Question about the CU Utilization

Xilinx / Vitis_Accel_Examples

Vitis_Accel_Examples

http://xilinx.github.io/Vitis_Accel_Examples/

MIT License

496 stars 204 forks source link

Question about the CU Utilization #41

Closed zhuofanzhang closed 3 years ago

zhuofanzhang commented 3 years ago

Hi, I have a question about the CU utilization of the array partition (cpp_kernels) example. If the overall Avg latencies of normal kernel and partition kernel are 9.522 us and 3.540 us respectively, why the Wall-Clock Time of hw execution results are 396685 ns and 256367 ns respectively? In other words, how could I achieve a higher CU utilization?

heeran-xilinx commented 3 years ago

Hi @zhuofanzhang, This example is mostly purpose to explain the benefits of array partition, which is highlighted using estimated latency using HLS. But measuring the same benefits using software host would not be meaningful as this kernel runs for very short time (single matrix multiplication of 16x16). Whereas while communicating to kernel form host, there are other overheads like configuring kernels, checking status etc. We are going to modify kernel such a way to prove the benefits of Partition using host code. But coming to your question of increasing Kernel utilization, you need to consider few things, It is better if real-world kernel should do as big computation as possible in single call, so that overhead of host -kernel communication is negligible/smaller.

zhuofanzhang commented 3 years ago

Hi @heeran-xilinx, Thanks for your quick reply. I am still a little confused about the overheads. Besides the communication overheads between host and kernel, is there any other overhead while CU executing? Below is the application timeline of the hardware execution of kernel matmul. The kernel has a very simple structure, i.e., readA, readB, calculate matmul, writeC. So I think the CU execution should be finished at t1. However, it won't finished untill t2. And (t1-t0) just matches the overall kernel latency simulated by HLS. I am wondering what is the CU doing between t1 and t2?

application timeline

heeran-xilinx commented 3 years ago

Hi @zhuofanzhang, This is very valid point and I am not exactly sure why there is a gap between final write (T1) and compute unit actual end (t2). This looks to me compute unit termination cycles but it should not be too big as per my opinion. If possible, could you please post this query to Xilinx Forum dedicated to Vitis tool flow here to get quick response for many vitis experts? https://forums.xilinx.com/t5/Vitis-Acceleration-SDAccel-SDSoC/bd-p/tools_v After posting, please paste the link here for reference.

-Herea

zhuofanzhang commented 3 years ago

Hi @heeran-xilinx Thanks for you advise. I have post the query to Xilinx Forum. https://forums.xilinx.com/t5/Vitis-Acceleration-SDAccel-SDSoC/Question-about-the-CU-execution-time/td-p/1264145

heeran-xilinx commented 3 years ago

closing this issue as discussion started in Xilinx forum on same topic.