Xilinx / Vitis-In-Depth-Tutorial

Other
116 stars 68 forks source link

Another question about wide_vadd sample #16

Open boxerab opened 3 years ago

boxerab commented 3 years ago

@rwarmstr got another question if you have time: I profiled the wide_vadd tutorial with some small modification to add float instead of integers. I also used 4 concurrent kernels.

Environment: U250, 48 core CPU, Ubuntu 18, latest XRT.

Timing for a single run of adding two buffers of size 810241024 ints was 24 ms. This seems like a long time to simply add two buffers together - is there anything else I can do to speed things up. For example, is there a way of harnessing the PLRAM on the board to make it faster ?

Any insights here would be really appreciated.

rwarmstr commented 3 years ago

I'm not sure from your comment since you mention you converted from int to float, and then later refer to ints, but do keep in mind that prior to the Versal architecture FPGA fabric does not natively support floating point types. As a result we need to add logic around the operations and that incurs a latency penalty.

But, looking at the numbers that still doesn't make sense. 810241024 samples divided by 24ms equates to around 7 clock cycles per add, which seems too high, and that also assumes that nothing is being done in parallel by the kernels.

Can you check a few things?

rwarmstr commented 3 years ago

Actually, sorry, I just realized you probably meant 8 1024 1024 or ~8M samples. That number works out approximately correctly to 1 sample per clock if you aren't taking advantage of parallelism in the kernel. So maybe you missed a pragma somewhere?

boxerab commented 3 years ago

Actually, sorry, I just realized you probably meant 8 1024 1024 or ~8M samples. That number works out approximately correctly to 1 sample per clock if you aren't taking advantage of parallelism in the kernel. So maybe you missed a pragma somewhere?

Thanks very much, Rob.Yes, the sample amounts to essentially transferring two HD RGBA video frames, adding each pixel, then transferring back to host. I switched to using a single kernel, and the timing is around 53 ms. So, I must be missing something because this is a very long time to perform such a simple op.

My code is almost identical to your wide_vadd sample, except I use float operations instead of int. Both input and output buffers for each kernel use a single DDR channel on the 250.

Here is the code:

https://github.com/GrokImageCompression/latke/tree/master/tests/wide_vadd

and there is a README file describing how I set up and build the kernel.

I am attaching the vitis analyzer files, if you have time to take a look.

vitis_analyzer.zip

Thanks again!

boxerab commented 3 years ago

Looking at the vitis guidance, it looks like the kernel write transfer size is only 64 bits, so it is not using the full 512 bit bus.

rwarmstr commented 3 years ago

It looks like the cause of the issue is that you're assigning the output value directly from the vadd_as_float() function, which implies that the kernel has to receive the AXI write response before it can process the next value. So it looks like it's working as coded/intended but you're spending most of your time in pipeline stalls.

You can analyze that through the HLS analysis view, but the quick fix (note that I haven't had a chance to try it myself yet) would be to break this into three main blocks: read, compute, and write. The write block should be pulling from the previously-declared result_tmp array. As it stands you're passing in out[i+j] by reference and that really slows things down.

Even doing something like:

uint512_dt temp; vadd_as_float(a, b, tmp); out[i+j] = tmp;

improves the scheduling by a lot, at least in code analysis. I'm not 100% sure how that would translate to runtime efficiency without trying it but the way I outlined above should work.

boxerab commented 3 years ago

Thanks for taking a look! I have made these changes, so the kernel does burst read to local buffer, then compute, then burst write back to global memory. The vitis guidance on this kernel looks better. However, timing is unchanged - 53 ms for a single kernel, and 23 ms for 4 concurrent kernels. So, I must be still missing something......

I'm attaching the new vitis_analyzer outputs in case you have time to take a peek.

Thanks again!

vitis_analyzer.zip

rwarmstr commented 3 years ago

There are two big things I see here in hardware emualation; one is that your FIFOs are drastically overprovisioned (occupancy doesn't exceed one in cosim, I set to 4 just to be safe) and two is that the processes are blocking about 66% of the time. That's mostly due to the way that you have a dataflow region inside nested for loops. It's a bit of an older practice (and I realize the example does this). It doesn't impact the Get Moving with Alveo code, but what happens in your case since there's a longer latency for the int to float conversion is that it basically reads 64 chunks, waits until all of the processing is done, reads another 64 chunks, waits, etc. You really want it to just read as fast as it can.

Try simplifying it to something like this:

void wide_vadd(
        const uint512_dt *in1, // Read-Only Vector 1
        const uint512_dt *in2, // Read-Only Vector 2
        uint512_dt *out,       // Output Result
        int size               // Size in integer
    )
    {
#pragma HLS INTERFACE m_axi depth=8192 port=in1 max_read_burst_length=32 offset=slave bundle=gmem
#pragma HLS INTERFACE m_axi depth=8192 port = in2 max_read_burst_length = 32 offset = slave bundle = gmem1
#pragma HLS INTERFACE m_axi depth=8192 port = out max_write_burst_length = 32 max_read_burst_length = 32 offset = slave bundle = gmem2
#pragma HLS INTERFACE s_axilite port = in1 bundle = control
#pragma HLS INTERFACE s_axilite port = in2 bundle = control
#pragma HLS INTERFACE s_axilite port = out bundle = control
#pragma HLS INTERFACE s_axilite port = size bundle = control
#pragma HLS INTERFACE s_axilite port = return bundle = control

    uint512_dt v1_local[BUFFER_SIZE]; // Local memory to store vector1
    uint512_dt v2_local[BUFFER_SIZE];  // Local memory to store vector2
    uint512_dt result_local[BUFFER_SIZE]; // Local Memory to store result

#pragma HLS stream variable = v1_local depth = 4
#pragma HLS stream variable = v2_local depth = 4
#pragma HLS stream variable = result_local depth = 4

    // Input vector size for integer vectors. However kernel is directly
    // accessing 512bit data (total 16 elements). So total number of read
    // from global memory is calculated here:
    int size_in16 = (size - 1) / VECTOR_SIZE + 1;

#pragma HLS DATAFLOW

    //burst read to local memory
    v1_rd:
        for (int j = 0; j < size_in16; j++) {
#pragma HLS pipeline
            v1_local[j] = in1[j];
            v2_local[j] = in2[j];
        }

    //perform vector addition
    v2_add:
        for (int j = 0; j < size_in16; j++) {
#pragma HLS pipeline
            auto tmpV1 = v1_local[j];
            auto tmpV2 = v2_local[j];
            result_local[j] = vadd_as_float(tmpV1, tmpV2);
        }

    //burst write to global memory
    v1_write:
        for (int j = 0; j < size_in16; j++) {
#pragma HLS pipeline
            out[j] = result_local[j];
        }
}
rwarmstr commented 3 years ago

With that code above resource utilization is significantly lower (down to 90 BRAM, 32 DSP, ~17k FF) and in cosimulation I see we hit approximately line rate. HW should be roughly the same, maybe slightly lower with the external DDR latency. So you should be able to process all 8M floats in ~2 ms (1.7 by my back-of-the-envelope calculation).

I'll also update the tutorial code to use this coding style since the new Vitis HLS front-end does better with it.

boxerab commented 3 years ago

Thanks again, Rob! really appreciate your help. I would love to get the time down to ~2 ms.

I built the kernel with your changes, and the timing did go down, but it is still high. The kernel took ~23 ms, and the entire frame, including PCI transfers, was 40 ms.

Is there anything else we can try for this kernel ?

I've checked my changes in to my Latke project, btw.

boxerab commented 3 years ago

Here are my vitis analyzer files. By the way, I built this kernel with the Vitis 20202.2 release.

vitis_analyzer.zip

rwarmstr commented 3 years ago

This is bandwidth-bound; there are three 512-bit channels contending for the same DDR controller. That's ~19 GB/s/channel, which saturates the available bandwidth (you have almost 60 GB of traffic going to a DDR with a theoretical max bandwidth of ~27).

So unfortunately it's fundamentally not going to get much faster and that was the purpose of the exercise, to point out that even though the fabric can keep up with this at the line rate the memory controller can't, so we are going to hit a speed limit.

boxerab commented 3 years ago

Thanks! In my case I am passing two 8 mega-float buffers into the kernel, and transferring one 8 mega-float buffer back, that should be a total of 64 MB in and 32 MB out.

rwarmstr commented 3 years ago

It doesn't really matter though; you're still transferring both input buffers at the same time over PCIe, the kernel is going to try to read them both at the same time, etc. The instantaneous memory pressure is still the same, it's just over a very short duration.

boxerab commented 3 years ago

Thanks, that makes sense. So, what is the best practice for doing high memory bandwidth compute on the 250 ?
Would interleaving the lines from the two buffers into a single buffer help ? Or using one controller per buffer? The problem with one controller per buffer is that I won't be able to use concurrent kernels.

rwarmstr commented 3 years ago

The best thing would be to split into multiple channels or between memory banks, but that can sometimes be difficult if you have to spread things out too much on the die. This is where HBM devices like the U50 excel since you're working with individual chiplets instead of a single combined channel. The same core is over 2x faster on HBM without any further optimization.

In general, if you have DDR DIMMs they don't perform as well for applications that are I/O bound, it makes more sense to look for compute-bound things or algorithms that can leverage streaming to stay in the FPGA fabric like some of the vision examples. You can stack up operations "indefinitely" and they'll all complete in approximately the same block of time.

boxerab commented 3 years ago

Thanks, I do have access to a U50, so I will give it a try. But I am still a bit puzzled - there are encoders that are implemented on Xilinx chips in the fabric, and they are able to encode 4K video at 30p or 60p.

For example:

https://www.intopix.com/JPEG2000-uhd-4k-8k-encoder-decoder

Is the U250 suitable for encoding at that data rate ?

rwarmstr commented 3 years ago

Absolutely; keep in mind that this is (on purpose) a bad example of how to use this, because the application is I/O bound. Video decompression (or compression) does not have this problem. Also at 60 FPS you have ~16 ms per frame for instance, which is generally about an 8MB data transfer. For reference the example that I linked to you that runs in ~23 ms on the U250 is processing 256 MB of data per channel, so it's reading 512 MB and writing 256 back to DDR in that amount of time.

If it were not I/O bound, the operation would take ~7ms for the compute at 16 words per clock to handle this amount of data.

Example 5 is better, because it shows overlapping of compute and transfer instead of doing everything linearly, but you still do generally want to focus on either pipelining operations (stacking things up without going to external memory) or on algorithms which are compute-bound instead of I/O bound.

boxerab commented 3 years ago

Thanks, that is good to know! I will take a look at Example 5, thanks.