Closed zhuofanzhang closed 3 years ago
Hi @zhuofanzhang, This example is mostly purpose to explain the benefits of array partition, which is highlighted using estimated latency using HLS. But measuring the same benefits using software host would not be meaningful as this kernel runs for very short time (single matrix multiplication of 16x16). Whereas while communicating to kernel form host, there are other overheads like configuring kernels, checking status etc. We are going to modify kernel such a way to prove the benefits of Partition using host code. But coming to your question of increasing Kernel utilization, you need to consider few things, It is better if real-world kernel should do as big computation as possible in single call, so that overhead of host -kernel communication is negligible/smaller.
Hi @heeran-xilinx, Thanks for your quick reply. I am still a little confused about the overheads. Besides the communication overheads between host and kernel, is there any other overhead while CU executing? Below is the application timeline of the hardware execution of kernel matmul. The kernel has a very simple structure, i.e., readA, readB, calculate matmul, writeC. So I think the CU execution should be finished at t1. However, it won't finished untill t2. And (t1-t0) just matches the overall kernel latency simulated by HLS. I am wondering what is the CU doing between t1 and t2?
Hi @zhuofanzhang, This is very valid point and I am not exactly sure why there is a gap between final write (T1) and compute unit actual end (t2). This looks to me compute unit termination cycles but it should not be too big as per my opinion. If possible, could you please post this query to Xilinx Forum dedicated to Vitis tool flow here to get quick response for many vitis experts? https://forums.xilinx.com/t5/Vitis-Acceleration-SDAccel-SDSoC/bd-p/tools_v After posting, please paste the link here for reference.
-Herea
Hi @heeran-xilinx Thanks for you advise. I have post the query to Xilinx Forum. https://forums.xilinx.com/t5/Vitis-Acceleration-SDAccel-SDSoC/Question-about-the-CU-execution-time/td-p/1264145
closing this issue as discussion started in Xilinx forum on same topic.
Hi, I have a question about the CU utilization of the array partition (cpp_kernels) example. If the overall Avg latencies of normal kernel and partition kernel are 9.522 us and 3.540 us respectively, why the Wall-Clock Time of hw execution results are 396685 ns and 256367 ns respectively? In other words, how could I achieve a higher CU utilization?