Xilinx / Vitis_Libraries

Vitis Libraries
https://docs.xilinx.com/r/en-US/Vitis_Libraries
Apache License 2.0
866 stars 351 forks source link

L1 2-Dimensional FFT Module Stalling in C/RTL Cosim and on FPGA Hardware #157

Open mengstro opened 1 year ago

mengstro commented 1 year ago

Hi everyone,

I'm trying to use the 2D FFT in some code I'm writing, however I'm noticing that the RTL simulation runs indefinitely after the C-sim runs (the C-sim looks correct and does not experience any issues). I've attempted to take the RTL files over to Vivado to run it as an IP, but it still halts even on real FPGA hardware. I'm not sure if the HW deadlock detection works 100% of the time, but I have tested the implementation with deadlock detection enabled, and it never reports a deadlock error

When I take out the 2D FFT and substitute it for a pass-through (output equals the input), it runs fine - no halting whatsoever. The issue appears to lie within the 2D FFT module, however it's not clear where the issue occurs.

I do also want to mention that a fair number the pragmas in the source code for the 2D FFT are outdated/deprecated (like the RESOURCE and DATA_PACK pragmas, for instance). I've made some substitutions to use the equivalent recommended pragmas, like BIND_STORAGE in place of RESOURCE, in order to ensure the code works as intended.

If I could get some assistance with this, I'd greatly appreciate it - if screenshots and project files are needed, just let me know!

P.S. I am running Vitis HLS 2022.1 on a 64-bit computer running Ubuntu 20.04.5 LTS (RAM is 64GB)

mengstro commented 1 year ago

Hi again everyone,

So after doing some digging, this stalling issue seams to come about from functions occurring out of order in the FFT kernels. I thought it was a FIFO depth issue, because changing the FIFO sizes caused the module to run a little longer before stalling, but this was just a symptom I was seeing, not the root cause.

Using Vitis HLS's C/RTL cosim "Wave Viewer" I saw that within the fftStage function (can be found in the "hls_ssr_fft.hpp" file), the streamingDataCommutor function was running before the fftStageKernelS2S function - I apologize if it's hard to see, but here's a screenshot: Screenshot from 2022-11-21 14-17-54

I got a little sidetracked thinking that this was the DATAFLOW pragma's fault, but I now believe that this is due to the non-blocking reads that occur within streamingDataCommutor in the "hls_ssr_fft_streaming_data_commutor.hpp" file. At the time of writing, I am prepping another cosim to see if my changes solve the problem - I planning on posting more as I find things out

All the best, Matt

mengstro commented 1 year ago

Okay, so the issue does appear to be within the streamingDataCommutor functions, but I'm experiencing pipelining issues with my "fix". Before the first for-loop in each function (except the "no stall" one), I simply put a while-loop that prevents any FIFO reads before data is available by checking the FIFO's empty() member function. By including this while-loop, Vitis HLS reports that it can't properly pipeline the for-loop following the while-loop(), which incurs a large latency penalty.

According to the performance metrics, I should be seeing a latency of ~900 cycles for a 32x32 input, but with the latency penalty, it goes up nearly a whole order of magnitude (~7500 cycles)

vt-lib-support commented 1 year ago

Hi, as we understand from your description is that you are trying to use the 2D FFT L1 IP to process multiple frames of the input. We doubled checkout our internal regression and L1 case, unmodified, and cannot see a hang.

We have the following suggestion: for integrating HLS module and let it process continuously, probably the FIFO interface would be easier. Please kindly refer to the L2 2D kernel's way of using the API: https://github.com/Xilinx/Vitis_Libraries/blob/33c3869f77b437d0bfcd196e8be0c7ad901af7db/dsp/L2/include/hw/vitis_2dfft/float/vitis_fft/fft_kernel.hpp#L114-L119 You may want wrap these lines into a function, use hls::stream (FIFO) as port, set it as top, and let HLS produce an IP from it.

mengstro commented 1 year ago

Hi @vt-lib-support,

Thanks for getting back! It appears that I am already using it the way you are suggesting, but I'm still experiencing the same issues - I'd happy to go through this in a little more detail if needed (we've been experiencing issues with this for a long while, but it wasn't until recently that I figured anything out).

I have a few requests, if you don't mind me asking, to see where things are going wrong:

  1. What were/are the specs for the computer this test was run on?
  2. What version of Vitis/Vivado was this test run on?
  3. Were these tests done using C Simulation, C/RTL Cosimulation, and/or in an actual hardware implementation? How old are the results?
  4. Is there a timing diagram/Wave Viewer output available for all the functions inside the 2D FFT module?

I apologize for the long laundry list of items, but it's just not clear why it's not working "out of the box"

All the best, Matt

vt-lib-support commented 1 year ago

Hi Matt,

You may want to check the vitis doc which explains how you can get an HLS co-sim project from our L2 Vitis case. Then that you can compare it with your waveform.

On regression, the libraries are always tested against corresponding tool release. i.e. main for latest official (2022.2 as now), 2022.1 for 2022.1 Vitis/HLS/Vivado, and so on. Generally, just follow corresponding tool's computer environment recommendation. All APIs have passed cosim/hw-emu (essentially, cycle accurate RTL simulation) before release, unless listed as known issue in release note.

mengstro commented 1 year ago

Hi @vt-lib-support,

If I'm understanding this correctly, the document you're referring me to is leveraging Vitis HLS to produce the kernel for use in Vitis IDE as an accelerated kernel. This makes sense for the L2 kernel example, but I'm trying to use the L1 module outside of Vitis IDE

For my use case, I plan on using the 2D FFT primitive inside another function that does a 2D convolution using FFT's, and I was looking to produce IP to use in a Vivado block diagram. Is it possible to obtain the same II/latency?

Understandably, there may be some differences between the final implementations of the 2D FFT in Vitis IDE as an accelerated kernel and Vivado as an IP, but it's not clear why there is such a significant difference in performance

All the best, Matt

mengstro commented 1 year ago

Hi @vt-lib-support,

So I took a look at the 2D FFT SNR test code in the L1 2D FFT example directory (fixed point, HW), and I ran the 16x16 test case - the timing results were... interesting. Resource utilization looked good, but the latency figures were on the order of a hundred thousand clock cycles, and on top of that, the output of the RTL simulation was inaccurate (SNR for C-sim was 82, the RTL-sim's was ~7.5). Here's a screenshot of the timeline trace I got back after running the makefile:

Screenshot from 2022-12-15 15-55-13

I'm still waiting for the L2 test case to finish since the HW EMU is taking a long time, which I anticipated. Again, not sure where these issues are coming from - I'd be happy to work with someone to resolve this.

In the meanwhile, if there are any clear and thorough instructions I could follow to reproduce the numbers in the online documentation, that would help me immensely. I can't seem to find such instructions on the Xilinx website (I even have access to a handful of courses and training materials), so it could very well be something that I am doing wrong and I just don't know it

Thank you for your time, Matt

vt-lib-support commented 1 year ago

Hi Matt,

We checked the DSP PL 2D FFT API and the corresponding unit test, the test can passed Cosim successfully with 2022.2 released Vitis-HLS. Since you packed the 2D FFT primitive as a Vivado IP and integrate to your kernel, the stall issue may be in the kernel. As the pure API (2D FFT) can passed Cosim, could you double check on your kernel design?

mengstro commented 1 year ago

Hi @vt-lib-support,

The code provided as-is under the Vitis DSP Libraries can be synthesized and run successfully in Cosim, however it does not perform as well as it ought to. When you ran the corresponding unit tests, did you see what the latency/II metrics were?

I have been speaking to someone who works under Xilinx regarding a related matter over email, and I thought it'd be a good idea to paraphrase the material here.

So I did rerun my tests using the code found here: https://github.com/Xilinx/Vitis_Libraries/tree/main/dsp/L1/tests/hw/2dfft/fixed/impulse_test/complex_impulse

I had to make a change to the top_2d_fft_test.hpp file to switch the FFT size from 16 to 32, but this was the only change I made. The command I ran in a terminal within the build directory was: make run XPART='xcvc1902-vsva2197-2MP-e-S' CSIM=1 CSYNTH=1 COSIM=1

As a sanity check, here are the test parameters the program returns during C sim:

================================================================================
---------------------Calling 2D FFT Kernel with Parameters----------------------
================================================================================
    The Main Memory Width (no. complex<float>)   : 8
    The Size of 1D Row Kernel                    : 32
    The SSR for 1D Row Kernel                    : 4
    The Transform Direction for Row Kernel       : Forward
    The Size of 1D Column Kernel                 : 32
    The SSR for 1D Row Kernel                    : 4
    The Transform Direction for Row Kernel       : Forward
    The Row Instance ID Offset                   : 40000
    The Column Instance ID Offset                : 80000
    Number of 1D Kernels Used Row/Col wise       : 2
    The Total Number of 1D Kernels Used(row+col) : 4
================================================================================

When the C sim and Cosim finished running, I opened up the HLS project in Vitis HLS. Here's what the timeline trace came back with: image

According to the 2020.2 version of the user guide found online, I should have seen a latency and II figure of 875 and 227, respectfully, however the timeline shows that the execution time for one 2DFFT kernel operation was nearly 6,000 cycles.

The point is that when it does work, it does not perform optimally (i.e. very slow), and when replacing the deprecated pragmas with ones that serve the same purpose as the old ones, it refuses to work without further modifications. The second point regarding my changes is a minor one - the main issue is the failure to meet timing when it does work

All the best, Matt

sandsbl commented 1 year ago

I wanted to quickly summarize where we are at with this issue:

vt-lib-support commented 1 year ago

Sorry for replying late, summary as follow: