Closed wonkyoc closed 2 years ago
Hi @wonkyoc , I run this testcase and see the same as you have also observed. The reason the first wait (xrt::run[0].wait
) showing long latency because during that time all the CUs are running in parallel. As you have seen this design has 4CU and the host code is launching 4 CUs in parallel. So when the first CU is finished, that is reflected by the long xrt::run[0].wait
duration, during that time CU2, CU3 and CU4 are also finished, or almost finished. So rest of the wait functions are very short.
You can see it by enabling hardware trace that can show you all CUs are running in parallel.
Compile the kernel again, but this time add a swicth in vadd.cfg
to enable hardware trace
[connectivity]
nk=vadd:4
[profile]
data=all:all:all
Compile the kernel and generate new XCLBIN.
Add the following switch in xrt.ini
to collect hardware trace
[Debug]
native_xrt_trace=true
device_trace=fine
Now in timeline you will see all 4 CUs are running in parallel and finishing almost same time. This is the reason xrt::run[0].wait
showing long latency, and rest of the waits are quick.
Thanks! I was expecting this result but I didn't know how to measure the actual kernel processing. I confirmed the exact same operation for my device. I am closing this issue.
I am analyzing one of the basic examples and I noticed a long latency for
xrt::run:wait
. I would like to know why this function has such as huge latency despite a short running time of an actual kernel.Here is the number for the example by U25:
The number is described in the below figure.
Given the execution time of [0] xrt::run::run + run::start, run[i].wait() should already be finished but it still waits for some reason. Would this be a possible bug?
Environment