Xilinx / open-nic-shell

AMD OpenNIC Shell includes the HDL source files
Apache License 2.0
94 stars 63 forks source link

Alveo U200 not transmitting/receiving any packets #15

Closed alacamester closed 1 year ago

alacamester commented 2 years ago

Hi!

I built the code without modification from the main repo with switches: vivado -mode tcl -source build.tcl -tclargs -board au200 -num_phys_func 2 -num_cmac_port 2

Then implemented the project using Vivado 2021.2, and uploaded the firmware to the card. (There were no timing issues whatsoever) The card was recognized after reboot (Ubuntu 20.04), and the onic-driver loaded without errors. (with RS_FEC_ENABLED=0)

Then I assigned IP addresses to the interfaces, and put the card into loopback mode by wrtiting '1' to 0x8090, and 0xC090. I checked link status: 0x8200 = 0x00000000 (no Tx error) 0x8204 = 0x00000003 (Rx aligned OK) (also 0x820C, 0x8210 showed that all lanes are syned/locked: 0x000FFFFF) (Also note, that link is not coming up, when I plug in 2 QSFP28 100GBASE-SR4 modules, and connect them together.)

Then I transmitted some packets, and checked by "ifconfig", that they show up under TX packets, but RX packets reads '0'.

I can see, that the packets reach the Tx packet adapter: 0xB000 = the amount of packets I transmitted [CORRECTION: It always shows 0x21, no matter how many packets I transmit! 0xB008 = 0x11CE always, Tx dropped shows '0'] 0xB020 = 0 (Rx packet adapter)

Also the CMAC "STAT_TX_TOTAL_PACKETS" register reads 0 (0x8500).

According to the manual, and what I see in code, the box_322 and packet adapter AXI-s are connected by default, so what could cause this issue? The 100G interfaces work using the CMAC_USPLUS example code, in phys, and GT loopback mode too.

Thanks, L.

cneely-amd commented 2 years ago

Hi @alacamester ,

In Linux, unless some network route is manually added when pinging between two network interfaces on the same machine, will bypass/skip and not actually transmit out the interfaces. (One exception for example is DPDK where you can loopback the interfaces without adding a route.)

So first I'd recommend trying two machines first, if that is an option in your test setup.

Someone else had suggested a test approach using namespaces, for example, as below, towards sending between two interfaces (within different logical namespaces). (I had previously pasted the following example when replying to issue #9 on open-nic-driver's repo)

eth0=enp10s0f0
eth1=enp10s0f1
ip netns add foobar
ip link set dev $eth0 down
ip link set dev $eth1 down
ip link set dev $eth1 netns foobar
ip address add 192.168.20.2/24 dev $eth0
ip netns exec foobar ip address add 192.168.20.3/24 dev $eth1
#info
ip address show
ip netns exec foobar ip address show
ip link set dev $eth0 up
ip netns exec foobar ip link set dev $eth1 up
ip netns exec foobar ip link set dev lo up
#info
ip route show
ip netns exec foobar ip route show`
"then any command from the namespace must be preceded by ip netns exec foobar in this case. For example, ip netns exec foobar ping 192.168.20.2 or ip netns exec foobar nc ..."

I tried something similar to the above example earlier and I was able to ping across the ports on the same U250, using the "ip netns exec foobar" preceding my commands and verifying the RX and TX packet counts using ifconfig (and also ifconfig with the prefix command).

With regards to the link status, for some reason two reads are necessary to see the current link status. For example, you might read 0xE0 and then the second time read and get 0x3 (correct status).

Typically some number of discovery and broadcast packets are transmitted when an interface detects a link status (so I think it's not an issue that 0xB000 was non-zero, however, I recommend checking that register before enabling loopback/before the link status is up). I suspect that these registers didn't change for the reason that I explained above, related to this being a single machine test scenario.

Please let me know whether this advice helps.

Best regards, --Chris

alacamester commented 2 years ago

Thanks for your answer!

I think that is not the case here, as ARP packets should be sent (and received in loopback) too, if I ping the same network (fe. onic ip=192.168.1.1, and I ping 192.168.1.2), but I tested your method as described in open-nic-driver #9, and it did not ping (with physical loopback, and interfaces put in loopback mode too).

ip netns exec foobar ifconfig -a

onic37s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.20.2 netmask 255.255.255.0 broadcast 0.0.0.0 inet6 fe80::20a:35ff:fe02:1b4f prefixlen 64 scopeid 0x20 ether 00:0a:35:02:1b:4f txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 50 bytes 6219 (6.2 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

I also read on 0x8204 multiple times (its a latched register, reading clears its value), and It always shows 0x000000C0 if the 2 ports are physically connected, or the loopback is disabled. (but it still sync's when 8090 is set to '1')

We also have a Silicom (Intel RRC-based) 100G card. The U200 could not establish a link with open-nic (but the CMAC example code could)

alacamester commented 2 years ago

I examined the CMAC example code I modified, and there is a U200 specific difference I found.

There are signals, that enable the QSFP modules:

constraints: set_property -dict {PACKAGE_PIN BE17 IOSTANDARD LVCMOS12} [get_ports QSFP0_RESETL] set_property -dict {PACKAGE_PIN BE20 IOSTANDARD LVCMOS12} [get_ports QSFP0_MODPRSL] set_property -dict {PACKAGE_PIN BE21 IOSTANDARD LVCMOS12} [get_ports QSFP0_INTL] set_property -dict {PACKAGE_PIN BD18 IOSTANDARD LVCMOS12} [get_ports QSFP0_LPMODE] set_property -dict {PACKAGE_PIN BE16 IOSTANDARD LVCMOS12} [get_ports QSFP0_MODSELL]

set_property -dict {PACKAGE_PIN BC18 IOSTANDARD LVCMOS12} [get_ports QSFP1_RESETL] set_property -dict {PACKAGE_PIN BC19 IOSTANDARD LVCMOS12} [get_ports QSFP1_MODPRSL] set_property -dict {PACKAGE_PIN AV21 IOSTANDARD LVCMOS12} [get_ports QSFP1_INTL] set_property -dict {PACKAGE_PIN AV22 IOSTANDARD LVCMOS12} [get_ports QSFP1_LPMODE] set_property -dict {PACKAGE_PIN AY20 IOSTANDARD LVCMOS12} [get_ports QSFP1_MODSELL]

Signal strappings added to the code: assign QSFP0_RESETL = 1'b1; assign QSFP0_LPMODE = 1'b0; assign QSFP0_MODSELL = 1'b1;

assign QSFP1_RESETL = 1'b1; assign QSFP1_LPMODE = 1'b0; assign QSFP1_MODSELL = 1'b1;

After that the link came up, when physically connecting the U200 interfaces via QSFP-s, and I could send/receive packets. :) (still don't know why it does not work in loopback-mode, but I don't care about that much)

cneely-amd commented 2 years ago

Glad to hear that it is working now.

alacamester commented 2 years ago

I wonder is it just for me, that it is not working without adding the signal strappings mentioned above, or is it an U200 specific problem? I used the latest board-files provided by Xilinx. Also CMC firmware is updated on the card for a Vitis project.

cneely-amd commented 2 years ago

Hi @alacamester,

I have access to U250 and U280 boards within my test environments, but I don't have a U200 in my local test setup.

For the strappings that you mentioned above, where did you add them to the code? Can you say more (because these seem very device specific)? Can you also please confirm that if you build the design without any modifications that it doesn't run in your test environment, given that you were modifying your test environment earlier?

@Hyunok-Kim contributed the U200 port.

@Hyunok-Kim did you require any other changes for the U200 version?

Thanks, --Chris

alacamester commented 2 years ago

I run some more tests, and I narrowed it down to the QSFP control signals which make a difference. Unmodified code from git: no link to any physical device (0x8204 = 0x000000C0) Added QSFP signal strappings to top-level: got link btw U200 interfaces, and Silicom 100G card (0x8204 = 0x00000003)

I checked the XDC file for U200, which you can download from Xilinx. (Xilinx Design Constraints (XDC) | alveo-u200-xdc_20210909.zip)

These are the comments: QSFP0 Control Signals RESETL - Active Low Reset output from FPGA to QSFP Module MODPRSL - Active Low Module Present input from QSFP to FPGA INTL - Active Low Interrupt input from QSFP to FPGA LPMODE - Active High Control output from FPGA to QSFP Module to put the device in low power mode (Optics Off) MODSEL - Active Low Enable output from FPGA to QSFP Module to select device for I2C Sideband Communication

The 2 signals that matter for QSFP operation are: RESETL, and LPMODE LPMODE is active-high, so by default it should be low. But RESETL is active-low, so it seems, that there are no pull-ups on RESETL, and it is always in reset ? (has someone access to Alveo schematics who can check it out?)

I think that there are QSFP modules which implement RESET, and there are some that don't. Thats why it is working for others, but not for me. My modules are from FS.com: QSFP-100G-SR4-S Compatible 100GBASE-SR4 QSFP28 850nm 100m DOM MTP/MPO MMF Optical Transceiver Module

If thats the case, then we could simply add a 'PULLTYPE' parameter to XDC (so top-level verilog must not be modifed), like this? (did not try it yet) set_property PULLTYPE PULLUP [get_ports QSFP0_RESETL] set_property PULLTYPE PULLDOWN [get_ports QSFP0_LPMODE]

set_property PULLTYPE PULLUP [get_ports QSFP1_RESETL] set_property PULLTYPE PULLDOWN [get_ports QSFP1_LPMODE]

Hyunok-Kim commented 2 years ago

I confirmed the latest upstream version won't work for me, neither @alacamester Could you try the old version I used and is working well ? https://github.com/Hyunok-Kim/open-nic-shell For build, I used Vivado 2021.2 in Ubuntu 20.04. You have to change vivado version in script/build.tcl I noticed that board file of the latest upstream version differs from that I used in the old version. I will check the effect

Hyunok-Kim commented 2 years ago

After I modified au200 board file in the latest version, I could fix the problem @alacamester Could you try the following branch? https://github.com/Hyunok-Kim/open-nic-shell/tree/fix-au200

The new au200 board file is extracted from Vitis Platform

$ dpkg-deb -xv xilinx-u200-gen3x16-xdma-1-202110-1-dev_1-3221508_all.deb ./tmp/
$ ls tmp/opt/xilinx/platforms/xilinx_u200_gen3x16_xdma_1_202110_1/hw/board/1.3/
au200_image.jpg  changelog.txt  part0_pins.xml  xitem.json
board.xml        LICENSE        preset.xml
alacamester commented 2 years ago

I confirm that its working, but board files for Vivado must be updated too. ( an "universal" solution would be to specify pin values in project by XDC, and assign, then board files dont matter)

alacamester commented 1 year ago

.