Xilinx / open-nic

AMD OpenNIC Project Overview
Apache License 2.0
226 stars 36 forks source link

Card shown in lspci but not in lshw or /sys/devices/pci0000:xx #5

Closed 108anup closed 3 years ago

108anup commented 3 years ago

TLDR;

I was trying to do the loopback test as mentioned in open-nic-driver. But the card (Alveo U50 with open nic shell) does not show up in /sys/devices, it does show up in ifconfig and lspci. Unsure what might be going on here.

Steps I followed:

I was trying out open nic shell and driver (both main branches and not the v1.0). I am using a Supermicro SYS2029GP-TR server with Ubuntu 18.04, Vivado 2020.2, and Linux kernel 4.15.0-161-generic. I loaded open nic shell bitstream onto an Alveo U50. On loading the bitstream (direct program, i.e, non MCS method) the first time (when the Alveo card had factory gold image), the server crashed and rebooted automatically. After this there wasn't any Xilinx devices shown in lspci. JTAG was still accessible, I re-loaded the open nic shell bitstream (non MCS method) (no crash this time) and did a warm reboot. After this the card does show up in lspci as a memory controller: 3b:00.0 Memory controller: Xilinx Corporation Device 903f

Next I loaded the onic.ko kernel module. No errors (apart from kernel module verification) were seen in dmesg. A new interface shows in ifconfig -a as enp59s0. I do sudo ifconfig enp59s0 10.1.212.190 netmask 255.255.255.0 up to bring it up and assign it an IP address.

Then I tried to test loopback following instructions on the open nic kernel driver page. It mentions following command to enable loopback:

sudo ./pcimem /sys/devices/pci0000:d7/0000:d7:00.0/0000:d8:00.0/resource2 0x8090 w 0x1

lspci showed bus as 0000:3b:00.0. This is also shown in sudo ethtool -i enp59s0 as:

driver: onic
version: 0.21
firmware-version:
expansion-rom-version:
bus-info: 0000:3b:00.0
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

However, there is no directory for this pcie device (e.g., pci0000:3b) in /sys/devices/. ls /sys/devices | grep pci0000 shows:

pci0000:00
pci0000:17
pci0000:3a
pci0000:5d
pci0000:80
pci0000:85
pci0000:ae
pci0000:d7

Thus the pcimem command won't work for this the device's bus. The Alveo board also does not show up in sudo lshw -businfo.

108anup commented 3 years ago

FYI, I did not use the setup device script while loading the shell bitstream.

solidgoldbomb commented 3 years ago

Under /sys/devices, you will find a hierarchy of devices that matches the bus hierarchy. Under /sys/bus/pci/devices, you will find a flattened view of all of the devices.

As an example, here are my devices as seen by lspci. I have a U280 card but it should be very similar for you.

$ lspci -D -d 10ee:
0000:65:00.0 Memory controller: Xilinx Corporation Device 903f
0000:65:00.1 Memory controller: Xilinx Corporation Device 913f

however, if I look in /sys/devices, I don't see them there (as expected) because the U280 or U50 is not a PCIe root complex.

$ ls /sys/devices | grep pci0000
pci0000:00
pci0000:16
pci0000:64
pci0000:b2

They're not PCIe root ports so they show up under their appropriate bus hierarchy like this:

$ lspci -tnnv | grep Xilinx
 +-[0000:64]-+-00.0-[65]--+-00.0  Xilinx Corporation Device [10ee:903f]
 |           |            \-00.1  Xilinx Corporation Device [10ee:913f]

Note the root port they're attached to is way up at 0000:64 which does show up directly under /sys/devices. And, if we go looking deeper under that tree, we can find our devices here:

$ find /sys/devices/ -type d -name '0000:65:00*'
/sys/devices/pci0000:64/0000:64:00.0/0000:65:00.0
/sys/devices/pci0000:64/0000:64:00.0/0000:65:00.1

My U280 device shows up in lshw like this:

$ sudo lshw -businfo | grep '0000:65:00'
pci@0000:65:00.0               memory         Memory controller
pci@0000:65:00.1               memory         Memory controller

Note that in my case, I don't have the kernel driver bound to it so I'm not sure if that changes how it looks under lshw for me.

108anup commented 3 years ago

Thanks for this great explanation.

The lspci -vt tree shows that device 0000:3b:00.0 is actually connected via the bridge on bus 3a (physically there is a PCIe riser that connects to the motherboard and exposes 2x PCIe x16 slots. I don't know if the tree represents this physical characteristic or if there is a bridge on the motherboard itself).

 +-[0000:3a]-+-00.0-[3b]----00.0  Xilinx Corporation Device 903f

And so, the device is actually in "/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0" instead of "/sys/devices/pci0000:3b/0000:3b:00.0". The same device in the flat hierarchy is also in /sys/bus/pci/devices/0000:3b:00.0.

I can actually use: sudo ./pcimem /sys/bus/pci/devices/0000:3b:00.0/resource2 0x8090 to read mmaped registers on BAR2:

/sys/bus/pci/devices/0000:3b:00.0/resource2 opened. 
Target offset is 0x8090, page size is 4096 mmap(0, 4096, 0x3, 0x1, 3, 0x8090)
PCI Memory mapped to address 0x7f21855bc000.
0x8090: 0x00000000
108anup commented 3 years ago

lshw also shows device as:

sudo lshw -businfo | grep '0000:3b:00'
pci@0000:3b:00.0  enp59s0    memory         Ethernet interface