Closed 108anup closed 2 years ago
@108anup , thanks, someone reported a similar error message recently, but I haven't encountered this myself.
What version of the kernel were you using earlier successfully?
What version of the kernel seems like it is no longer working?
The driver is not working on either of the following kernels: 4.15.0-177-generic and 4.15.0-176-generic on Ubuntu 28.04.
I don't think the kernel has changed actually between past and now. So I would imagine, the working kernel would also be 4.15.0-176-generic. But I can't confirm as I don't know what was the kernel last time it ran properly.
I have tried both U50 and U280 boards on a Supermicro SYS-2029GP-TR/X11DPG-SN server.
Does the open nic driver need to be loaded to use pcimem? If not, I can't read any address from the device over PCIe using pcimem as well. This is for both U50/U280 using vanilla open nic shell.
Example:
$> sudo $PCIMEM /sys/bus/pci/devices/$EXTENDED_DEVICE_BDF1/resource2 0x10400
/sys/bus/pci/devices/0000:3b:00.0/resource2 opened.
Target offset is 0x10400, page size is 4096
mmap(0, 4096, 0x3, 0x1, 3, 0x10400)
PCI Memory mapped to address 0x7f2a49067000.
0x10400: 0xFFFFFFFF
This address ideally returns the temperature.
The driver doesn't need to be loaded to use pcimem. However, on some kernel versions to use pcimem or similar, I need to run something like: sudo setpci -s 0a:00.0 COMMAND=0x02;
I'm curious though is it that you can't use pcimem before attempting to load/install onic.ko or afterwards? If that somehow loading the driver first is causing your issue with pcimem.
Even before trying to load the driver, I am unable to read registers using pcimem.
Steps: On a fresh cold reboot:
echo 1 | sudo tee "/sys/bus/pci/devices/${bridge_bdf}/${EXTENDED_DEVICE_BDF1}/remove" > /dev/null
echo 1 | sudo tee "/sys/bus/pci/devices/${bridge_bdf}/rescan" > /dev/null
sudo setpci -s $EXTENDED_DEVICE_BDF1 COMMAND=0x02
After this when I try pcimem, it shows 0xFFFFFFFF
irrespective of the address.
Since pcimem should work even without driver. I am guessing that the inability to read from pcimem might be causing insmod to crash. I am not sure though why pcimem might not be working with the vanilla bitstream. This has worked for me before.
Okay, I actually also had to warm reboot after step 4 above. I think a warm reboot is needed whenever the FPGA changes from non open nic bitstream to open nic. After cold reboot, the image would be reset to the golden platform e.g., xilinx_u280_xdma_201920_3 for U280.
After warm reboot both pcimem and driver loading work.
I might have done a cold reboot since the last time I used open nic and forgot to do a warm reboot after loading open nic bitstream.
Thank so much for your previous suggestions. I followed the mentioned four steps, and warm reboot it. U250 still can not show the interface through ifconfig
. Furthermore, when I warm reboot, the onic module will be disapear and has to insmod again. The onic kernel can also not be hanging. Could you please talk me any suggestions? Thank you so much in advance.
65:00.0 Network controller: Xilinx Corporation Device 903f
Subsystem: Xilinx Corporation Device 0007
Flags: bus master, fast devsel, latency 0, IRQ 124, NUMA node 0
Memory at e0c40000 (64-bit, non-prefetchable) [size=256K]
Memory at e0800000 (64-bit, non-prefetchable) [size=4M]
Capabilities: <access denied>
Kernel driver in use: qdma-pf
Kernel modules: xdma, qdma_pf
65:00.1 Network controller: Xilinx Corporation Device 913f
Subsystem: Xilinx Corporation Device 0007
Flags: bus master, fast devsel, latency 0, IRQ 124, NUMA node 0
Memory at e0c00000 (64-bit, non-prefetchable) [size=256K]
Memory at e0400000 (64-bit, non-prefetchable) [size=4M]
Capabilities: <access denied>
Kernel driver in use: qdma-pf
Kernel modules: qdma_pf
Have you loaded the onic bitstream using the Xilinx Vivado toolchain?
Once you do that and do a warm reboot, in lspci -vvv
, the FPGA should show up as a "Memory Controller" instead of a "Network Controller". Then when you insmod
the onic.ko
kernel module, then lspci -vvv
should also show the kernel module as onic.
Reference lspci -vvv
output:
86:00.0 Memory controller: Xilinx Corporation Device 903f
Subsystem: Xilinx Corporation Device 0007
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 199
NUMA node: 1
Region 0: Memory at e0c00000 (64-bit, non-prefetchable) [size=256K]
Region 2: Memory at e0400000 (64-bit, non-prefetchable) [size=4M]
Capabilities: <access denied>
Kernel driver in use: onic
86:00.1 Memory controller: Xilinx Corporation Device 913f
Subsystem: Xilinx Corporation Device 0007
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 199
NUMA node: 1
Region 0: Memory at e0c40000 (64-bit, non-prefetchable) [size=256K]
Region 2: Memory at e0800000 (64-bit, non-prefetchable) [size=4M]
Capabilities: <access denied>
Kernel driver in use: onic
There was a pull request that we merged earlier this year that changed it from "Memory Controller" into a "Network Controller".
Two suggestions that are probably important are:
First confirm that the FPGA bit file meets the necessary timing constraints by opening the project file for the design within Vivado's GUI and check the worst negative slack WNS. Also, please note that the design needs to be built with Vivado 2022.1 related to QDMA IP (Vivado 2022.2 and Vivado 2023.1 introduce a major version change to the QDMA IP that doesn't seem to be compatible with this design yet. If you build it with an old version of Vivado just don't upgrade the QDMA IP if you open within a newer Vivado.). If the implementation results don't meet the timing, change the vivado implementation strategy to something like "Performance Explore" or "Performance1" options to improve the place and route effort, set the new implementation settings to "active", and rebuild.
After the bit file meets timing, the order should be: 1) load bit file on FPGA, 2) warm reboot, and 3) run insmod ... 4) run ifconfig
If it doesn't appear here are some suggestions: 5) try checking the link status with pcimem or similar by reading the CMAC status registers:
#writes to enable the CMAC (adjust for your PCI BDF resource path)
sudo ~cneely/pcimem/pcimem /sys/devices/pci0000:00/0000:00:03.2/0000:0b:00.0/resource2 0x8014 w 0x1;
sudo ~cneely/pcimem/pcimem /sys/devices/pci0000:00/0000:00:03.2/0000:0b:00.0/resource2 0x800c w 0x1;
# read the link status, two reads are necessary. The second read should be 0x3 if you have link
sudo ~cneely/pcimem/pcimem /sys/devices/pci0000:00/0000:00:03.2/0000:0b:00.0/resource2 0x8204;
sudo ~cneely/pcimem/pcimem /sys/devices/pci0000:00/0000:00:03.2/0000:0b:00.0/resource2 0x8204;
5) try running ip link show
to see the adapter name and number , e.g.
5: enp11s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 00:0a:35:ad:bf:c8 brd ff:ff:ff:ff:ff:ff
6: enp11s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 00:0a:35:73:31:40 brd ff:ff:ff:ff:ff:ff
maybe assign a static IP using e.g. netplan. create a yaml file for netplan within /etc/netplan/
#U250
network:
version: 2
renderer: networkd
ethernets:
# in this example my other ethernet device is enp40, which needs dhcp
enp4s0:
dhcp4: yes
dhcp6: yes
addresses: [192.168.1.109/24]
# in this case my U250 open-nic-shell interfaces are below
enp11s0f0:
dhcp4: no
dhcp6: no
addresses: [192.168.20.4/24] #this is just what I used for my testing
#...
Thank you so muh for your reply. I followed your mentioned four steps.
Our vivado version is 2021.2 and the WNS is 0.034.
lspci -vd 10ee:
65:00.0 Network controller: Xilinx Corporation Device 903f
Subsystem: Xilinx Corporation Device 0007
Flags: fast devsel, IRQ 124, NUMA node 0
Memory at e0c40000 (64-bit, non-prefetchable) [virtual] [size=256K]
Memory at e0800000 (64-bit, non-prefetchable) [virtual] [size=4M]
Capabilities: [40] Power Management version 3
Capabilities: [60] MSI-X: Enable- Count=10 Masked-
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [1c0] Secondary PCI Express
Capabilities: [200] Virtual Channel
Kernel driver in use: qdma-pf
Kernel modules: xdma, qdma_pf
65:00.1 Network controller: Xilinx Corporation Device 913f
Subsystem: Xilinx Corporation Device 0007
Flags: fast devsel, IRQ 124, NUMA node 0
Memory at e0c00000 (64-bit, non-prefetchable) [virtual] [size=256K]
Memory at e0400000 (64-bit, non-prefetchable) [virtual] [size=4M]
Capabilities: [40] Power Management version 3
Capabilities: [60] MSI-X: Enable- Count=9 Masked-
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Kernel driver in use: qdma-pf
Kernel modules: qdma_pf
>try checking the link status with pcimem or similar by reading the CMAC status registers: I checked it, there are correct.
Then I continue to check the onic.io kernel via lsmod
, which aslo installed in host.
Module Size Used by
ftdi_sio 61440 1
onic 118784 0
Finally, I still can not the netwrok interface via ifconfig
.
I think the problem might be the onic.io insmod. For my case, I can not hanging on onic.io on kernel driver. But I have no ideal how to continue, if you have any suggestions, please provide to me. If you need any other infomation, please talk to me. And much appreaciate for your time and suggestions.
Best regards~
I suspect the issue is due to xdma
and qdma_pf
kernel modules being loaded. Can you temporarily blacklist or not load those modules when trying onic module?
I suspect the issue is due to
xdma
andqdma_pf
kernel modules being loaded. Can you temporarily blacklist or not load those modules when trying onic module?
That's truth. I blacklist xdma
and qdma_pf
kernel modules, and onic kernel can suffessfully use, and thus I
can get the interface of U250. Thank you so much. Additionally, is there any solution to make these three kernels compatible? Thank you so much again.
Sorry, another quesstion, when I warm reboot the FPGA (or host), the onic disppears and insomd again. Is it normal for the onic driver? Thank you so much again.
Sorry, another quesstion, when I warm reboot the FPGA (or host), the onic disppears and insomd again. Is it normal for the onic driver? Thank you so much again.
@manwu1994 in case you havent resolved this yet, kernel modules loaded with insmod are not permanent. To load kernel modules at boot see here: https://www.cyberciti.biz/faq/linux-how-to-load-a-kernel-module-automatically-at-boot-time/
We have managed to narrow this issue down to the memory range of the BARs and/or the prefetch setting in the QDMA IP. Ill update once I have a clearer picture of what is going on.
@cneely-amd do you know why the prefetch is not enabled by default in the open NIC designs? Is there any reason we cant enable it?
@wnew I don't know what the tradeoffs would be for enabling/using prefetch vs. without. I've been maintaining the OpenNIC shell and drivers, but I didn't create the original designs, and so some of the reasoning behind certain design choices I don't know.
Are you experimenting with prefetch?
--Chris
insmod onic.ko
is hanging. I see the following in dmesg log. It used to work fine earlier. Could it be issue due to interaction with an updated linux kernel?This is using vanilla open nic shell bitstream on a U280 FPGA.
dmesg log: