Xilinx-CNS / onload

OpenOnload high performance user-level network stack
Other
574 stars 95 forks source link

SFC9250 ctpio fallback failures #241

Open osresearch opened 2 months ago

osresearch commented 2 months ago

I'm benchmarking two machines with SFC9250 cards direct connected with 25G fiber on one port, as well as via a 10G switch on the other port. If the ctpio threshold is larger than the packet size, there is no problem on either port. On the 10G port, if the ctpio threshold is set lower than the payload length in eflatency, the ctpio fallback seems to work -- I see the expected CRC error on the receive side and it then receives the correct packet afterwards (I'm running with my #238 patches to validate that the packet received matches, although it occurs with the stock build as well).

On the 25G direct connect port, however, the fallback doesn't always work. It isn't deterministic; sometimes it makes it through several iterations before failing, other times it fails in the first few loops like this one. Usually after a EF_EVENT_TYPE_RX_DISCARD with a bad CRC caused by the CTPIO underun, there is a good EF_EVENT_TYPE_RX, but sometimes it never arrives and the ping or the pong can get hung waiting for the next packet.

cpuset-exec taskset -c 2 ./thudson/eflatency -s 128 -c 64 -r -v -v  pong ens2f0
# ef_vi_version_str: b58f121a 2024-08-29 eflatency-echo ef_vi
# NIC(s) 22 -1
# udp payload len: 128:128:1
# iterations: 100000
# warmups: 10000
# frame len: 170
# mode: CTPIO
# vlan: 0
# validating: yes
# verbose: 2
event [ev:2]
RX discard bad CRC (2)
event [ev:0]
event [ev:1]
event [ev:2]
RX discard bad CRC (2)
event [ev:0]
event [ev:1]
event [ev:0]
event [ev:1]
event [ev:2]
RX discard bad CRC (2)
event [ev:0]
event [ev:1]
event [ev:0]
event [ev:1]
event [ev:2]
RX discard bad CRC (2)

Interestingly, it seems to also depend on the value of ctpio threshold and the size of the packet. Very lower values for the threshold work (8-16), as do ones close to the packet size (90-128). It also does not seem to affect very short (<64bytes) or larger packets like 1K.

This is with onload 8.1.3 and ethtool -i ens2f0 reports:

driver: sfc
version: 5.3.18.1012
firmware-version: 8.0.0.1015 rx1 tx1
expansion-rom-version:
bus-info: 0000:5e:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: yes

lspci | grep 5e

5e:00.0 Ethernet controller: Solarflare Communications XtremeScale SFC9250 10/25/40/50/100G Ethernet Controller (rev 01)
5e:00.1 Ethernet controller: Solarflare Communications XtremeScale SFC9250 10/25/40/50/100G Ethernet Controller (rev 01)
osresearch commented 2 months ago

Searching through the docs, it seems that this configuration is not supported: https://docs.amd.com/r/en-US/ug1586-onload-user/CTPIO-Modes

CTPIO feature can be used in three modes:

  • cut-through (ct) Lowest latency. A packet is transmitted onto the network as it is being streamed across the PCIe bus. The adapter starts transmitting the packet even before the entire packet has been delivered over the PCIe bus. Note: This mode is supported at 10 Gb, but not at 25 Gb.

Should the library detect if the user is attempting to use CTPIO on a 25G link?