Xilinx / video-sdk

https://xilinx.github.io/video-sdk
Other
30 stars 14 forks source link

Failed to load /opt/xilinx/xcdr/xclbins/transcode.xclbin to device #84

Closed tamasgp closed 9 months ago

tamasgp commented 9 months ago

I repeatedly face an issue with my Alveo U30 cards on every server reboot: When sourcing the /opt/xilinx/xrt/setup.sh script, I get back an error message like: { "response": { "name": "load", "requestId": "1", "status": "failed", "data": { "failed": "xclLoadXclBin failed, rc = -5, failed to load /opt/xilinx/xcdr/xclbins/transcode.xclbin to device X" } } }

The device number is usually different between reboots. (I have 6 cards in the server, but the issue cannot be limited to 1 Alveo card). After some cold or hot resets all the devices successfully initialize, and everything work fine until the next server reboot.

During this issue I can see the following messages (truncated, just copied the error part): [ 81.185241] xclmgmt 0000:b2:00.0: xfer_versal.m.27262987 ffff8b5b96495810 xfer_versal_transfer: start writting data_len: 13107697, timeout: 24s [ 81.759175] xclmgmt 0000:b2:00.0: xfer_versal.m.27262987 ffff8b5b96495810 wait_for_status: Timeout, packet header is fffc0201 [ 81.759219] xclmgmt 0000:b2:00.0: xfer_versal.m.27262987 ffff8b5b96495810 xfer_versal_transfer: Data transfer error [ 81.761099] xclmgmt 0000:b2:00.0: mailbox.m.9437195 ffff8b3bad5c0810 mailbox_post_response: posting response for: 7 via HW [ 81.761200] xocl 0000:b2:00.1: icap.u.23068683 ffff8b3babb55c10 __icap_peer_xclbin_download: peer xclbin download err: -5 [ 81.762267] xocl 0000:b2:00.1: icap.u.23068683 ffff8b3babb55c10 icap_download_bitstream_axlf: err: -5 [ 81.762280] xocl 0000:b2:00.1: ffff8b3bb56a70b0 xocl_init_mem: Topology count = 1, data_length = 40 [ 81.762300] xocl 0000:b2:00.1: ffff8b3bb56a70b0 xocl_read_axlf_helper: Failed to download xclbin, err: -5

Can someone tell me how to debug or fix this issue?

tamasgp commented 9 months ago

Forgot to mention that all cards have the latest firmware.

NastoohX commented 9 months ago

Hi, Thank you for bringing this matter to our attention. If you could provide short answers to the following questions, it would help us in debugging this issue: 1- Does this problem happen on every reboot? If so, does it imply that you are not able to use the cads or is there a workaround? 2- Just confirming that this is an on-prem setup and not a cloud deployment? 3- Have you observed the same behaviour with a single card in the chassis? 4- Can provide the out of the following commands: lspci -d 10ee: and xbutil examine? 5- Would you also be able to provide the output of Step 11 of SDK installation, https://xilinx.github.io/video-sdk/v3.0/getting_started_on_prem.html#:~:text=of%20the%20machine.-,Test,-that%20the%20installation ?

Cheers,

NastoohX commented 9 months ago

Hi, There being no further activity on this thread, I am closing this ticket. Feel free to reopen this or open a new ticket if the need arises. Cheers,