Xilinx / Vitis-AI

Vitis AI is Xilinx’s development stack for AI inference on Xilinx hardware platforms, including both edge devices and Alveo cards.
https://www.xilinx.com/ai
Apache License 2.0
1.45k stars 626 forks source link

Got dpu timeout error after successfully inference once. #831

Closed imzongjian closed 2 years ago

imzongjian commented 2 years ago

86 TM8BQ($AJSR8S{M)~XDD The info here: root@zcu3_plnx:/usr/share/vitis_ai_library/samples/yolov4# ./test_jpeg_yolov4 yolov4s_coco_416_b1600 road416.jpg WARNING: Logging before InitGoogleLogging() is written to STDERR W0605 05:15:05.852934 7798 xrt_cu.cpp:188] cu timeout! device_core_idx 0 handle=0xaaab1b63e630 ENV_PARAM(XLNX_DPU_TIMEOUT) 10000 state 1 ERT_CMD_STATE_COMPLETED 4 ms 10010 bo=1 is_done 0 I0605 05:15:05.853041 7798 xrt_cu.cpp:99] Total: 10010332us ToDriver: 18446742734978439us ToCU: 0us Complete: 0us Done: 1348741444us F0605 05:15:05.853070 7798 dpu_control_xrt_edge.cpp:186] dpu timeout! core_idx = 0 LSTART 0 LEND 0 CSTART 0 CEND 0 SSTART 0 SEND 0 MSTART 0 MEND 0 CYCLE_L 1614041207 CYCLE_H 52 Check failure stack trace: Aborted

I cannot figure it out because it will succeed only once after each time board power on.

Verision : vitis ai 2.0, vitis petalinux 2021.2 Board: Customed board. chip : zcu3eg

qianglin-xlnx commented 2 years ago

Hi @imzongjian Could you please check the dpu interrupts before and after you run the test_jpeg_yolov4 program?

cat /proc/interrupts | grep zocl

If the number of interrupts of dpu does not increase, it means that there is a problem with the integration of the dpu. Probably, you can check the dpu configuration in the device tree.

imzongjian commented 2 years ago

@qianglin-xlnx Thanks for your solution , my device tree contained &amba node. So if I deleted it , things will go well, right? And I have a question about sysroots, according to link: [https://github.com/Xilinx/Vitis-AI/blob/master/setup/mpsoc/VART/README.md#step2-setup-the-target] I can get a sysroots by running ./host_cross_compiler_setup.sh, but according to [https://github.com/Xilinx/Vitis-Tutorials/blob/2021.2/Vitis_Platform_Creation/Introduction/02-Edge-AI-ZCU104/step2.md] I run sdk.sh and I can get a sysroot also. Which one should be linked to Vitis IDE?

tianfang-fafafa commented 2 years ago

hi @imzongjian second question, I think the two SDKs are same type, but they should contain different compoenents because they are from different steps and different readme, their configure may be different, too. you can look into this, please https://github.com/Xilinx/Vitis-Tutorials/blob/2021.2/Vitis_Platform_Creation/Introduction/02-Edge-AI-ZCU104/step4.md image

tianfang-fafafa commented 2 years ago

@imzongjian , the first question. Firstly,you can check the interruption of zocl_cu (it is dpu) as qianglin mentioned

qianglin-xlnx commented 2 years ago

Hi @imzongjian Is this issue solved?

JH989876525 commented 2 years ago

HI guys, I comes into this issue with xilinx IDE 2022.1 and vitis ai 2.5. I find out the system didnt get the interrupt from DPU correctly after the first inference. The whole cmd I tried and the results shows below :

$ ./test_jpeg_yolov4 yolov4-tiny_usb_2_to_7_pytorch usb-blk-1.png
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0826 01:22:49.460566  1305 demo.hpp:1183] batch: 0     image: usb-blk-1.png
I0826 01:22:49.460831  1305 process_result.hpp:44] RESULT: 0    343.401 194.306 525.801 452.706 0.997589
$ cat /proc/interrupts | grep zocl
 74:          0          0          0          0     GICv2 122 Level     zocl_cu[1]
$ ./test_jpeg_yolov4 yolov4-tiny_usb_2_to_7_pytorch usb-blk-1.png
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0826 01:52:43.823026  1259 xrt_cu.cpp:188] cu timeout! device_core_idx 0  handle=0xaaab074eb180 ENV_PARAM(XLNX_DPU_TIMEOUT) 10000 state 1 ERT_CMD_STATE_COMPLETED 4 ms 10010  bo=1 is_done 0 
I0826 01:52:43.823288  1259 xrt_cu.cpp:99] Total: 10010201us    ToDriver: 18446743704292691us   ToCU: 0us       Complete: 0us   Done: 379427061us
F0826 01:52:43.823331  1259 dpu_control_xrt_edge.cpp:186] dpu timeout! core_idx = 0
 LSTART 0  LEND 0  CSTART 0  CEND 0  SSTART 0  SEND 0  MSTART 0  MEND 0  CYCLE_L 2002069038  CYCLE_H 0 
*** Check failure stack trace: ***
Aug 26 01:52:42 xilinx-kv260-starterkit-20221 kernel: zocl-drm axi:zyxclmm_drm:  ffff000809ea8410 kds_del_cu_context: 1 outstanding command(s) on CU(0)
$ cat /proc/interrupts | grep zocl
 74:          0          0          0          0     GICv2 122 Level     zocl_cu[1]

The interrupt of DPU in vivado block design is connect to pl_ps_irq0[0]. image Thus, according to the user guide, its interrupt index in device-tree should be 89 and the index of GIC IRQ should be 121. image (4) But the GIC IRQ index of DPU in previous part is 122. I thinks this is the reason cause this issue.

$ cat /proc/interrupts | grep zocl
 74:          0          0          0          0     GICv2 122 Level     zocl_cu[1]

After I manually change the connection of DPU interrupt in vivado block design and regenerate the bitstream, the DPU can inference multiple times with correct result. image

Hope this post could help you guys. But how do I fix this issue once for all?

jiaz-xlnx commented 2 years ago

hi @JH989876525 I don't know how do you modify the platform. but there is a simple method. when you delete the interrupt signal and re-connect it to lin1[0:0]. there are two tcl commands in the console window. please copy it into prj/Vitis/syslink strip_interconnects.tcl file at the last. then compile the whole project. you can also check the project to verify the connections.

qianglin-xlnx commented 2 years ago

Hi @JH989876525 and @imzongjian There is no update from you for a period, assuming this is not an issue any more. Hence we are closing this topic. If need further support, please open a new one. Thanks