Xilinx / xup_vitis_network_example

VNx: Vitis Network Examples
Other
137 stars 43 forks source link

Server crashes when loading the benchmark bitfile #77

Closed trashcrash closed 2 years ago

trashcrash commented 2 years ago

I have 2 alveo U280 cards on a same host, directly connected. Built the design using make all DEVICE=xilinx_u280_xdma_201920_3 INTERFACE=3 DESIGN=benchmark Here's what I was trying to run:

from vnx_utils import *
import pynq
xclbin = '../benchmark.intf3.xilinx_u280_xdma_201920_3/vnx_benchmark_if3.xclbin'
ol_w0 = pynq.Overlay(xclbin,device=pynq.Device.devices[0])
ol_w1 = pynq.Overlay(xclbin,device=pynq.Device.devices[1])

The system crashes and reboots automatically when ol_w0 = pynq.Overlay(xclbin,device=pynq.Device.devices[0]) is executed. I also confirmed the devices do exist:

for i in range(len(pynq.Device.devices)):
    print("{}) {}".format(i, pynq.Device.devices[i].name))

Output:

0) xilinx_u280_xdma_201920_3
1) xilinx_u280_xdma_201920_3

I tried 2 times on the single bitfile, and re-compiled the design for a new bitfile. All lead to system reboot.

I've found some other thread likely concerning this problem (https://support.xilinx.com/s/question/0D52E00006hpJNLSA2/system-crashes-when-pcie-bit-is-burned-on-fpga?language=en_US), could this be the boards on PCIe are disconnected somehow? I ran the basic designs without any problem. Any idea why the crash happens? Thanks!


System information Manufacturer: Supermicro Product Name: X9DRG-QF Version: 0123456789

OS version LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 7.9.2009 (Core) Release: 7.9.2009 Codename: Core

XRT version XRT Build Version: 2.6.655 Build Version Branch: 2020.1

pynq version PYNQ version 2.7.0

mariodruiz commented 2 years ago

Hi,

PCIe are disconnected somehow?

No, this should not happen. The link you reference is for Vivado designs not Vitis.

Can you try to grab the dmesg logs of the previous boot?

Can you try to program the Alveo cards in a different order?

Can you try to program the Alveo cards with xbutil?

The XRT version seems quite old as well

This problem does not seem to be related to VNx nor PYNQ, but I'll try to provide some help

trashcrash commented 2 years ago

The dmesg log is too long so I uploaded it to drive https://drive.google.com/file/d/1zzh4L7YiTF7bIBLyzLN-Q2S45vplo2A_/view?usp=sharing

A different order gives the same crash

Using xbutil program -d 1 -p vnx_benchmark_if3.xclbin still results in system crashing.

You are right, it doesn't seem to be related to your work. I'll close the issue after your response. Thanks again for the good work you've achieved :)

mariodruiz commented 2 years ago

The dmesg message is for the current boot. I am interested in the dmesg message when the system crash, you can try something like this https://unix.stackexchange.com/a/345978

A few things you could try:

Could this be a power problem?

trashcrash commented 2 years ago

With only 1 single interface the server doesn't crash. The server simply couldn't handle 2 interfaces, power-wise or otherwise, it seems.