Xilinx / Vitis-Tutorials

Vitis In-Depth Tutorials
https://Xilinx.github.io/Vitis-Tutorials/
MIT License
1.23k stars 553 forks source link

Hardware emulation of Hardware_Acceleration/Feature_Tutorials/01-rtl_kernel_workflow runs forever #302

Closed dguenzel closed 2 years ago

dguenzel commented 2 years ago

Hi,

I am following this tutorial: https://github.com/Xilinx/Vitis-Tutorials/tree/2022.1/Hardware_Acceleration/Feature_Tutorials/01-rtl_kernel_workflow

When I launch hardware emulation it runs for a very long time until the maximum number of loop iterations is reached. Then the error "Value read back does not match reference" is thrown.

I believe there is an error in lines 60 & 61 of the host code: https://github.com/Xilinx/Vitis-Tutorials/blob/6410087e99cce0ff23807c28be7ffd60ce04a09a/Hardware_Acceleration/Feature_Tutorials/01-rtl_kernel_workflow/reference-files/src/host/user-host.cpp#L60-L61

If I set the memory bank indexes to 0 (instead of 1) emulation finishes in two iterations:

argc = 2
argv[0] = /mnt/workspace/ems4/Vitis-Tutorials/Hardware_Acceleration/Feature_Tutorials/01-rtl_kernel_workflow/reference-files/work1/kernelTest/Emulation-HW/kernelTest
argv[1] = /mnt/workspace/ems4/Vitis-Tutorials/Hardware_Acceleration/Feature_Tutorials/01-rtl_kernel_workflow/reference-files/work1/kernelTest_system/Emulation-HW/binary_container_1.xclbin
Open the device 0
Load the xclbin /mnt/workspace/ems4/Vitis-Tutorials/Hardware_Acceleration/Feature_Tutorials/01-rtl_kernel_workflow/reference-files/work1/kernelTest_system/Emulation-HW/binary_container_1.xclbin
INFO: [HW-EMU 07-0] Please refer the path "/mnt/workspace/ems4/Vitis-Tutorials/Hardware_Acceleration/Feature_Tutorials/01-rtl_kernel_workflow/reference-files/work1/kernelTest/Emulation-HW/.run/26305/hw_em/device0/binary_0/behav_waveform/xsim/simulate.log" for more detailed simulation infos, errors and warnings.
INFO: [HW-EMU 01] Hardware emulation runs simulation underneath. Using a large data set will result in long simulation times. It is recommended that a small dataset is used for faster execution. The flow uses approximate models for Global memories and interconnect and hence the performance data generated is approximate.
configuring embedded scheduler mode
scheduler config ert(1), dataflow(0), slots(16), cudma(0), cuisr(0), cdma(0), cus(1)
Allocate Buffer in Global Memory
loaded the data
synchronize input buffer data to device global memory
INFO: Setting IP Data
Setting Register "A" (Input Address)
Setting Register "B" (Input Address)
INFO: IP Start
Read Loop iteration: 1 and Axi Control = 6
Read Loop iteration: 2 and Axi Control = 4
INFO: IP Done
Get the output data from the device
TEST PASSED
INFO::[ Vitis-EM 22 ] [Time elapsed: 0 minute(s) 26 seconds, Emulation time: 0.124025 ms]
Data transfer between kernel(s) and global memory(s)
Vadd_A_B_1:m00_axi-HBM[0]          RD = 16.000 KB              WR = 0.000 KB        
Vadd_A_B_1:m01_axi-HBM[0]          RD = 16.000 KB              WR = 16.000 KB       

INFO: [HW-EMU 06-0] Waiting for the simulator process to exit

Can you please confirm the error and fix to assure me that it's working correctly?

Thanks you very much.

randyh62 commented 2 years ago

I guess it depends on what you are seeing in your build. When I run the build I see the following messages in the compilation transcript:

INFO: [CFGEN 83-0] Kernel Specs: 
INFO: [CFGEN 83-0]   kernel: Vadd_A_B, num: 1  {Vadd_A_B_1}
INFO: [CFGEN 83-2226] Inferring mapping for argument Vadd_A_B_1.A to DDR[1]
INFO: [CFGEN 83-2226] Inferring mapping for argument Vadd_A_B_1.B to DDR[1]
INFO: [SYSTEM_LINK 82-37] [16:15:00] cfgen finished successfully

This seems to indicate the need for the 1 in the buffer creation (xrt::bo()) rather than 0. When I run the hardware emulation I get the following results: Allocate Buffer in Global Memory loaded the data synchronize input buffer data to device global memory INFO: Setting IP Data Setting Register "A" (Input Address) Setting Register "B" (Input Address) INFO: IP Start Read Loop iteration: 1 and Axi Control = 1 Read Loop iteration: 2 and Axi Control = 1 Read Loop iteration: 3 and Axi Control = 6 Read Loop iteration: 4 and Axi Control = 4 INFO: IP Done Get the output data from the device TEST PASSED

The XRT native API provides the following explanation of this argument: Below is a example of creating two buffers. Note the last argument of xrt::bo is the enumerated index of the memory bank as seen by the XRT (in this example index 8 corresponds to the host-memory bank). The bank index can be obtained by xbutil examine --report memory command.

You can also get the memory assignment directly from the xclbin file: xclbinutil -i vadd.xclbin --info

What does your xclbin indicate?

dguenzel commented 2 years ago

You are right, I haven't considered that the memory mapping might change depending on the target hardware. The tutorial states that it is for Alveo u200 (DDR4), but I have actually built for u55c (HBM).

My build log shows this:

NFO: [CFGEN 83-0] Kernel Specs: 
INFO: [CFGEN 83-0]   kernel: Vadd_A_B, num: 1  {Vadd_A_B_1}
INFO: [CFGEN 83-2226] Inferring mapping for argument Vadd_A_B_1.A to HBM[0]
INFO: [CFGEN 83-2226] Inferring mapping for argument Vadd_A_B_1.B to HBM[0]
INFO: [SYSTEM_LINK 82-37] [10:34:05] cfgen finished successfully

This matches the information in the xclbin. Thank you for the clarification!

BananaTaiga commented 1 year ago

Hello @dguenzel! I'm currently working with this tutorial and having exactly the same issue. May I ask you how exactly can I change the memory mapping so the issue can be fixed? I'm also using u55c card.

dguenzel commented 1 year ago

Hello, the solution is already in the first post. It is simply a matter of giving the correct index for the HBM channel when the buffer objects are created:

auto ip1_boA = xrt::bo(device, vector_size_bytes, 0); 
auto ip1_boB = xrt::bo(device, vector_size_bytes, 0);

The last argument specifies the index of the memory bank where the buffer will be allocated. You can find these indexes as @randyh62 described, but 0 should work for u55c.

BananaTaiga commented 1 year ago

@dguenzel Now I understand it. At first I just thought that false mapping was made by me during XO/IP Packaging and didn't believe that the actual solution requires only change of one parameter in host program. Now it perfectly works in both HW-Emulation and actual HW-Run. Thank you!