Xilinx / XRT

Run Time for AIE and FPGA based platforms
https://xilinx.github.io/XRT
Other
549 stars 464 forks source link

xrt::run:wait long latency #6601

Closed wonkyoc closed 2 years ago

wonkyoc commented 2 years ago

I am analyzing one of the basic examples and I noticed a long latency for xrt::run:wait. I would like to know why this function has such as huge latency despite a short running time of an actual kernel.

Here is the number for the example by U25:

# Host application
for (i = 0;  i < 4; i++)
    run[i] = kernel()
for (i = 0;  i < 4; i++)
    run[i].wait()

# Execution time
[0] xrt::run::run + run::start = 56 us
[1] xrt::run::run + run::start = 28 us
[2] xrt::run::run + run::start = 16 us
[3] xrt::run::run + run::start = 17 us

[0] xrt::run::wait = 1,224 us (~= 1.2 ms)
[1] xrt::run::wait = 55 us
[2] xrt::run::wait = 0.4 us
[3] xrt::run::wait = 0.4 us

The number is described in the below figure. image

Given the execution time of [0] xrt::run::run + run::start, run[i].wait() should already be finished but it still waits for some reason. Would this be a possible bug?

Environment

uday610 commented 2 years ago

Hi @wonkyoc , I run this testcase and see the same as you have also observed. The reason the first wait (xrt::run[0].wait) showing long latency because during that time all the CUs are running in parallel. As you have seen this design has 4CU and the host code is launching 4 CUs in parallel. So when the first CU is finished, that is reflected by the long xrt::run[0].wait duration, during that time CU2, CU3 and CU4 are also finished, or almost finished. So rest of the wait functions are very short.

You can see it by enabling hardware trace that can show you all CUs are running in parallel.

Compile the kernel again, but this time add a swicth in vadd.cfg to enable hardware trace

[connectivity]
nk=vadd:4
[profile]
data=all:all:all

Compile the kernel and generate new XCLBIN.

Add the following switch in xrt.ini to collect hardware trace

[Debug]
native_xrt_trace=true
device_trace=fine

Now in timeline you will see all 4 CUs are running in parallel and finishing almost same time. This is the reason xrt::run[0].wait showing long latency, and rest of the waits are quick.

trace

wonkyoc commented 2 years ago

Thanks! I was expecting this result but I didn't know how to measure the actual kernel processing. I confirmed the exact same operation for my device. I am closing this issue.