Open shirshen12 opened 3 years ago
Hi Shirshendu,
Can you share dmesg to allow assessing whether there is any hint for the problem?
I am not familiar wit xdpdump, does "ip link" output confirms no program?
Would you be able to perform the following test for xdp support on hv_netvsc/c5?
Kind Regards, Maciej
Hi @maciejj-xilinx ,
The VMs with the Mellanox NiC do have XDP enabled programs running. I use the xdp-tutorial, packet rewrite program to verify XDP functionality.
Setup the XDP environment and clone the tutorials repo and compile the program
sudo yum install clang llvm kernel-headers bpftool
git clone --recurse-submodules https://github.com/xdp-project/xdp-tutorial.git
cd xdp-tutorial
cd packet02-rewriting/
make
Load the program and verify if its running:
sudo ./xdp_loader -d enP28349s1
Success: Loaded BPF-object(xdp_prog_kern.o) and used section(xdp_port_rewrite)
- XDP prog attached on device:enP28349s1(ifindex:3)
- Pinning maps in /sys/fs/bpf/enP28349s1/
ip link show dev enP28349s1
3: enP28349s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 xdp qdisc mq master eth0 state UP mode DEFAULT group default qlen 1000
link/ether 00:0d:3a:7d:9c:68 brd ff:ff:ff:ff:ff:ff
prog/xdp id 26
xdpdump -D
Interface Prio Program name Mode ID Tag Chain actions
--------------------------------------------------------------------------------------
lo <No XDP program loaded!>
eth0 <No XDP program loaded!>
enP28349s1 xdp_port_rewrite_func native 26 3b185187f1855c4c
As can be seen from above we see xdp program loaded with tag 26
Now we loaded xdp_stats program to see packet stats:
./xdp_stats -d enP28349s1
Collecting stats from BPF map
- BPF map (bpf_map_type:6) id:16 name:xdp_stats_map key_size:4 value_size:16 max_entries:5
XDP-action
XDP_ABORTED 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:0.250298
XDP_DROP 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:0.250233
XDP_PASS 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:0.250234
XDP_TX 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:0.250236
XDP_REDIRECT 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:0.250239
XDP-action
XDP_ABORTED 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.000304
XDP_DROP 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.000304
XDP_PASS 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.000304
XDP_TX 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.000304
XDP_REDIRECT 0 pkts ( 0 pps) 0 Kbytes ( 0 Mbits/s) period:2.000304
Thanks for all that. As XDP functionality is supported in the system, I'd expect onload to either get on with installing the program or some dmesg message indicating a problem? Could you share dmesg output?
Hi @maciejj-xilinx
As I had filed a ticket #37, Onload does compile successfully.
Also, please observe that mlx5_core driver version 5.4.1 has AF_XDP ZC functionality built in.
Also please note that when the NiC is registered, the bpftool o/p is as follows
Before NiC is registered
[root@sriov-onload2 ~]# bpftool prog list
[root@sriov-onload2 ~]#
After NiC is registered:
[root@sriov-onload2 ~]# bpftool prog list
6: xdp name xdpsock tag 278a08739491e2c9 gpl
loaded_at 2021-07-23T11:25:44+0000 uid 0
xlated 208B jited 155B memlock 4096B map_ids 4
The dmesg o/p after NiC is registered:
[140649.416558] [onload] Onload <dev-snapshot>
[140649.416559] [onload] Copyright 2019-present Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
[140649.522209] onload_cp_server[192084]: Spawned daemon process 192106
[151049.816064] Using feature eBPF/rawtrace.
[151268.059408] [sfc efrm] efrm_nondl_register_device: register enP28143s1
[151268.059411] [sfc efrm] enP28143s1 type=4:
[151269.020115] [sfc efrm] enP28143s1 index=0 ifindex=3
[151269.020128] [onload] oo_nic_add: ifindex=3 oo_index=0
After onloading memcached:
[root@sriov-onload2 ~]# onload -p latency memcached -m 24576 -c 1024 -t 4 -u root -l 10.113.65.36:11211
oo:memcached[210058]: Using Onload <dev-snapshot> [1]
oo:memcached[210058]: Copyright 2019-present Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
dmesg corresponding to above command execution:
[151685.072355] [onload] [1]: WARNING: huge pages are incompatible with AF_XDP. Disabling hugepage support.
[151685.083504] Using feature AF_XDP.
onload stackdump o/p, indicating stacks are created:
[root@sriov-onload2 ~]# onload_stackdump
#stack-id stack-name pids
1 - -
After applying load on "Onloaded Memcached", doing a perf top yields no traffic/functions from Onload being called. Please see below screenshot:
Do we need to enable a flag for the mux in Onload to route traffic to this stack ?
Onload definitely requires its XDP program being installed to intercept the traffic. Since this the XDP program installation does not take effect for some reason the lack of acceleration is expected.
Results of your experiment show that there is an XDP program support in the system. It puzzles me that Onload believes it succeeded registering the device, which includes attaching the program to the net device, while system state does not support this. The BPF program itself is apparently loaded.
Since this seems to be the only thing that is missing I wonder if you could try manually bpftool net attach
the loaded prog to your device.
Perhaps this would give us some clue.
Should this succeed would this result in accelerated traffic? Should this fail would this give some error message/code?
Hello @maciejj-xilinx
Please see the following the steps I have done to get xdpsock to be detected by the driver as a XDP program to be loaded.
TL;DR: xdpsock is now detected properly but traffic still being processed by kernel stack and not Onload
Before manually attaching: (xdpsock program not listed)
[root@sriov-onload2 ~]#bpftool net list dev enP28143s1
xdp:
tc:
flow_dissector:
Manually attaching:
[root@sriov-onload2 ~]# bpftool net attach xdp name xdpsock dev enP28143s1
[root@sriov-onload2 ~]#
Verify if program is being detected:
[root@sriov-onload2 ~]# bpftool net list dev enP28143s1
xdp:
enP28143s1(3) driver id 6
tc:
flow_dissector:
Verify using xdpdump:
[root@sriov-onload2 ~]# xdpdump -D
Interface Prio Program name Mode ID Tag Chain actions
--------------------------------------------------------------------------------------
lo <No XDP program loaded!>
eth0 <No XDP program loaded!>
enP28143s1 xdpsock native 6 278a08739491e2c9
dmesg o/p corresponding to above action:
[411474.900787] Using feature eBPF/xdp.
[411591.274512] [onload] [4]: WARNING: huge pages are incompatible with AF_XDP. Disabling hugepage support.
Verify onload stack being created:
[root@sriov-onload2 ~]# onload_stackdump
#stack-id stack-name pids
4 - -
Now, offload memcached:
[root@sriov-onload2 ~]# onload -p latency memcached -m 24576 -c 1024 -t 4 -u root -l 10.113.65.36:11211
oo:memcached[561541]: Using Onload <dev-snapshot> [4]
oo:memcached[561541]: Copyright 2019-present Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
Now, when we generate traffic for offloaded memcached, again no u_polls, only k_polls and hence no traffic being processed by onload stack and traffic bounced off to kernel.
[root@sriov-onload2 ~]# onload_stackdump stats | grep polls
k_polls: 318603
u_polls: 0
ioctl_evq_polls: 0
periodic_polls: 5533
interrupt_polls: 313069
deferred_polls: 0
timeout_interrupt_polls: 0
Verified that Onload is not in processing path at all, via perf top:
More telling would be reported number of events.: onload_stackdump stats | grep evs
But I suppose this also shows 0 - this would mean no traffic hitting AF_XDP socket.
Ok, so rx_evs stats are not zero:
[root@sriov-onload2 ~]# onload_stackdump stats | grep evs
rx_evs: 55990
tx_evs: 0
periodic_evs: 0
interrupt_evs: 0
This is perplexing. The u_polls are still zero.
[root@sriov-onload2 ~]# onload_stackdump stats | grep polls
k_polls: 1363836
u_polls: 0
ioctl_evq_polls: 0
periodic_polls: 7304
interrupt_polls: 1356531
deferred_polls: 0
timeout_interrupt_polls: 0
I have some lead, its an issue with queue binding as seen by MLNX mlx5_core driver. I will give a detailed report shortly.
So, @maciejj-xilinx this is the issue, please do take time to read this entire thread, its very informative. https://www.spinics.net/lists/xdp-newbies/msg01252.html
Onload makes available EF_AF_XDP_ZEROCOPY=0
that allows disabling zerocopy, and
EF_IRQ_CHANNEL=<channel_no>
to bind to the specific channel
By default Onload tries to enable 0 copy and uses channel selected based on CPU affinity.
So, @maciejj-xilinx I have been perf tracing and it does look like xdpsock as inserted by Onload, gets exercised, but most probably the packets get bounced off to kernel stack, because of this hardware queue numbering issue as mentioned above, see below:
So, if there is an easy fix to make the packets pass via Onload, by making some command line changes then we are good, else a fix to accommodate this weird hardware queue issue from Mellanox for AF_XDP ZC is required in Onload.
Can you please validate or refute this ?
I was hoping that either:
EF_AF_XDP_ZEROCOPY=0
(to disable zerocopy and use standard queue number), or alternativelyEF_IRQ_CHANNEL=<high number>
(to make it work for zerocopy )
would make it workJust to double check: the traffic is just standard tcp/ip - no encapsulation, right?
Can you check for presence of relevant hardware filters?
Hi @maciejj-xilinx
I see no hardware ntuple filters at all, when I do:
[root@sriov-onload2 ~]# ethtool -n enP28143s1
Also traffic is standard TCP/IP and no encapsulation.
We'd expect a hw filter there. Well at least with SFC and Intel NICs. Have you got any indication of filter insertion error in dmesg?
If the filter is not there, I'd suggest adding a HW filter manually. For memcached the case is simple a 3 tuple filter for listen socket.
ethtool --config-ntuple enP28143s1 flow-type tcp4 dst-ip 10.113.65.36 dst-port 11211 action 4
Action is queue number. I'd suppose this need to match the queue number Onload is using and good guess would be using the high number for zerocopy.
On many NICs ethtool stats show traffic per queue this is one way to check whether the traffic goes to the right queue.
Hopefully, we will find all the collection of issues soon.
So, I have not added the hw filter rule, but on further digging for XDP_REDIRECT stats on Mellanox NiCs and I find a massive number of XDP redirects.
[root@sriov-onload2 ~]# ethtool -S enP28143s1 | grep redirect
rx_xdp_redirect: 7175774
rx_xsk_xdp_redirect: 0
rx0_xdp_redirect: 7175777
rx1_xdp_redirect: 0
rx2_xdp_redirect: 0
rx3_xdp_redirect: 0
rx4_xdp_redirect: 0
rx5_xdp_redirect: 0
rx6_xdp_redirect: 0
rx7_xdp_redirect: 0
So, this means that Onload's XDP handler insertion program does intercept the traffic and then redirects to kernel stack.
Let me enable the hw filter and check now.
So, no luck @maciejj-xilinx . I added the hardware filter and see only xdp redirects. I think it needs a deeper fix as mentioned in that earlier mailing list link, which I posted earlier.
A bit puzzling is that the XDP program does not get attached to the interface nor does the filter.
Does some operation fail and Onload code somehow misses it? Hard to say would be good to print result of each call in af_xdp_init
.
Let's say the above can be worked around with manual operations as suggested earlier and
we are left with the queue number issue.
From reading the ticket we should use EF_IRQ_CHANNEL=8
(the same number for ethtool action)
Onload's XDP program uses BPF map to redirect traffic from the bound queue to its XSK socket. However, Onload does not expect the queue number to be bigger than what it read from the NIC info.
It looks that this code should help. It would allow queue number 8. And would protect us in case XDP program gets misreported queue number (e.g.0) - I am not certain what to expect.
diff --git a/src/lib/efhw/af_xdp.c b/src/lib/efhw/af_xdp.c
index f3a0a5b56a..b6149eb04c 100644
--- a/src/lib/efhw/af_xdp.c
+++ b/src/lib/efhw/af_xdp.c
@@ -794,6 +794,10 @@ static int af_xdp_init(struct efhw_nic* nic, int instance,
if( rc < 0 )
goto out_free_user_offsets;
+ xdp_map_update(nic->af_xdp, instance % nic->vi_lim, file);
+ xdp_map_update(nic->af_xdp, instance + nic->vi_lim, file);
+
+
/* TODO AF_XDP: currently instance number matches net_device channel */
rc = xdp_bind(sock, nic->net_dev->ifindex, instance, vi->flags);
if( rc == -EBUSY ) {
@@ -928,7 +932,7 @@ __af_xdp_nic_init_hardware(struct efhw_nic *nic,
xdp->vi = (struct efhw_af_xdp_vi*) (xdp + 1);
xdp->pd = (struct protection_domain*) (xdp->vi + nic->vi_lim);
- rc = map_fd = xdp_map_create(sys_call_area, nic->vi_lim);
+ rc = map_fd = xdp_map_create(sys_call_area, nic->vi_lim * 2);
if( rc < 0 )
goto fail_map;
@maciejj-xilinx so, let me generate a patch from the above fragment you posted and test again.
So, @maciejj-xilinx no luck. The hardware filter that we apply in the NiC simply blackholes the traffic. I think, it will be a better idea to debug this in a Azure VM from your side.
The problem then seems less obvious. If I was to progress on my part I'd need access to at least the NIC model. In some respects it might be preferable to actually try this on bare-metal with the NIC model first. It is fairly likely to have more to do with mlx5 drivers than Azure VM.
Hello @shirshen12 ,
I tested latest Onload (latest commit on master branch; c9ccdcdfcd9ce12930c364011ebec1c64e49fb88) on Mellanox CX4-LX. It works. Here is my setup.
Host OS: Ubuntu 20.04.3 LTS Host OS Kernel: 5.4.0-88-generic #99 Guest OS: Ubuntu 20.04.3 LTS Guest OS Kernel: 5.4.0-88-generic #99 CPU: AMD EPYC 7713 64-Core Processor vCPU for Guest: 32 memory for Guest: 64GB Host GRUB kernel parameter /proc/cmdline: BOOT_IMAGE=/boot/vmlinuz-5.4.0-88-generic root=UUID=3e63ff61-49f4-40ef-a0b2-9b991fc03a9c ro maybe-ubiquity iommu=pt amd_iommu=on transparent_hugepage=never hpet=disable tsc=reliable selinux=0 processor.max_cstate=0 Guest GRUB Kernel parameter /proc/cmdline: BOOT_IMAGE=/boot/vmlinuz-5.4.0-88-generic root=UUID=6b18114a-e2da-4d1f-b897-39b55cca2031 ro console=tty1 console=ttyS0
Mellanox Card driver and firmware on host OS
[] % ethtool -i cx4eth0
driver: mlx5_core
version: 5.4-1.0.3
firmware-version: 14.23.1020 (MT_2470111034)
expansion-rom-version:
bus-info: 0000:21:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
[] % lspci | grep Mellanox
21:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
21:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
21:00.2 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
21:00.3 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
Mellanox NIC card info in VM
[ubuntu@vm01:~/workspace/code/onload] master ± ethtool -i enp5s0
driver: mlx5_core
version: 5.0-0
firmware-version: 14.23.1020 (MT_2470111034)
expansion-rom-version:
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
Parameter for VM creation: 21:00.2 is the VF to VM
virt-install --connect qemu:///system --virt-type kvm --name vm01 --ram 65536 --vcpus=32 --os-type linux --os-variant ubuntu20.04 --disk path=vm01.qcow2,device=disk --disk path=vm01-seed.qcow2,device=disk --import --network network=default,model=virtio,mac=$MAC_ADDR --noautoconsole --host-device=pci_0000_21_00_2
Then follow the DEVELOPING.md to build onload and register Mellanox NIC VF to AF_XDP.
Tested onload with netperf. (Please ignore the performance. Because this Mellanox NIC is the management interface of my server which is connected to a 1G switch. :-( )
vm01# ./onload -p latency netperf -H 10.23.81.3 -t TCP_STREAM
oo:netperf[8819]: Using Onload <dev-snapshot> [7]
oo:netperf[8819]: Copyright 2019-present Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.23.81.3 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
131072 16384 16384 10.00 935.85
In another terminal, I can see u_polls increased.
vm01# ./onload_stackdump stats | grep polls
k_polls: 2
u_polls: 127759
ioctl_evq_polls: 0
periodic_polls: 2
interrupt_polls: 0
deferred_polls: 0
timeout_interrupt_polls: 0
Hope it helps.
Regards, Zhiyi
@zhiyisun This is very helpful, this might mean something is off in Host side in Azure. As you can see from prior comments, I am trying on CentOS 8, 4.18+ kernel. Also, Onload-on-AF_XDP works Intel NiCs on CentOS 4.18+, but yes let me try on Ubuntu 20.04 LTS (5.4+) and try to reproduce your steps.
So @zhiyisun I have been able to validate the instructions in vultr.com which has a single port 25GBe Mellanox Connect X 5 X NiC on Ubuntu 20.04 LTS
But, those instructions dont work on Ubuntu 20.04 LTS in Azure Ds_v4 type of VM (with SR-IOV passthrough access).
It would be awesome if we can get on a call and debug why aint working, are you ok with that ?
Hello @shirshen12 , actually, I am able to replicate the issue on Azure. It's the same as you described above. We need some time to debug this issue. Will let you know when we get some clue.
@Shirshen12. Regarding HW filter insertion in netvsc environment. With your comment #issuecomment-893672203 you have indicated you attempted to insert HW filter. I understood it succeeded as you have not reported any error with this command.. However xdp-redirect stats in subsequent comment did not show the filter took effect. So to clarify:
@maciejj-xilinx yes. I was able to insert the filter and it did show the filter as well, but the traffic either got blackholed or when i dont insert the rule it does bounce the traffic to kernel transparently. Also, in this case the traffic always showed up at queue-0 but per the above link, it shows some weird issue in numbering. The problem is what must be that hardware queue number ? In Intel NiCs we see those hardware filter rules transparently added.
Also @maciejj-xilinx and @zhiyisun, to test your work you will need to run a VM with:
and upgrade the driver to Mellanox_OFED 5.4+ since this version onwards AF_XDP ZC support is enabled.
@maciejj-xilinx / @zhiyisun can we check if Onload can fallback into XDP generic mode if there are certain clauses absent DRV mode, like say the XDP_REDIRECT mode in hv_netsvc driver ?
@maciejj-xilinx / @zhiyisun currently MSFT Azure is implementing the XDP_REDIRECT clause in Hyper-V PCI driver, I think after that it will start working, because if the dummy driver issue.
The patch has been implemented, https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=1cb9d3b6185b
I will test and report and close this if it works.
The patch is now in mainline Linux kernel:
and in Ubuntu as well: https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/log/drivers/net/hyperv
Hello Onload Team,
I was able to get Onload to work on Mellanox NiC (SR-IOV mode) in Azure. Please see: https://github.com/Xilinx-CNS/onload/issues/37 for details.
While the app is now being offloaded, we see no traffic being processed by the stack. Please see the following command update:
As can be seen, u_polls are ZERO and k_polls are incrementing, meaning userstack poll-mode driver is not being exercised.
Also, when we run the xdpdump -D command, I see no AF_XDP program loaded on the eBPF VM:
I have already registered the interface and enabled hugepages:
Help is appreciated.