Application offloaded to Onload on AF_XDP on Mellanox NiC in Azure does not process traffic and bounces off traffic to kernel stack

shirshen12 commented 3 years ago

Hello Onload Team,

I was able to get Onload to work on Mellanox NiC (SR-IOV mode) in Azure. Please see: https://github.com/Xilinx-CNS/onload/issues/37 for details.

While the app is now being offloaded, we see no traffic being processed by the stack. Please see the following command update:

[root@sriov-onload1 ~]# onload_stackdump 
#stack-id stack-name      pids
6         -               -

[root@sriov-onload1 ~]# onload_stackdump stats | grep polls
k_polls: 3789944
u_polls: 0
ioctl_evq_polls: 0
periodic_polls: 2695
interrupt_polls: 3787248
deferred_polls: 0
timeout_interrupt_polls: 0

As can be seen, u_polls are ZERO and k_polls are incrementing, meaning userstack poll-mode driver is not being exercised.

Also, when we run the xdpdump -D command, I see no AF_XDP program loaded on the eBPF VM:

[root@sriov-onload1 ~]# xdpdump -D
Interface        Prio  Program name      Mode     ID   Tag               Chain actions
--------------------------------------------------------------------------------------
lo                     <No XDP program loaded!>
eth0                   <No XDP program loaded!>
enP28349s1             <No XDP program loaded!>

I have already registered the interface and enabled hugepages:

[root@sriov-onload1 ~]# ulimit -l unlimited
[root@sriov-onload1 ~]# echo 800 > /proc/sys/vm/nr_hugepages
[root@sriov-onload1 ~]# echo enP28349s1 > /sys/module/sfc_resource/afxdp/register

Help is appreciated.

maciejj-xilinx commented 3 years ago

Hi Shirshendu,

Can you share dmesg to allow assessing whether there is any hint for the problem?

I am not familiar wit xdpdump, does "ip link" output confirms no program?

Would you be able to perform the following test for xdp support on hv_netvsc/c5?

install a sample xdp program successfully
- ensure this is shown by the diagnostic tools such as xdpdump and "ip link" and
- actually XDP program takes effect on the traffic?

Kind Regards, Maciej

shirshen12 commented 3 years ago

Hi @maciejj-xilinx ,

The VMs with the Mellanox NiC do have XDP enabled programs running. I use the xdp-tutorial, packet rewrite program to verify XDP functionality.

Setup the XDP environment and clone the tutorials repo and compile the program

sudo yum install clang llvm kernel-headers bpftool
git clone --recurse-submodules https://github.com/xdp-project/xdp-tutorial.git
cd xdp-tutorial
cd packet02-rewriting/
make

Load the program and verify if its running:

sudo ./xdp_loader -d enP28349s1
Success: Loaded BPF-object(xdp_prog_kern.o) and used section(xdp_port_rewrite)
 - XDP prog attached on device:enP28349s1(ifindex:3)
 - Pinning maps in /sys/fs/bpf/enP28349s1/

ip link show dev enP28349s1
3: enP28349s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 xdp qdisc mq master eth0 state UP mode DEFAULT group default qlen 1000
    link/ether 00:0d:3a:7d:9c:68 brd ff:ff:ff:ff:ff:ff
    prog/xdp id 26 

xdpdump -D
Interface        Prio  Program name      Mode     ID   Tag               Chain actions
--------------------------------------------------------------------------------------
lo                     <No XDP program loaded!>
eth0                   <No XDP program loaded!>
enP28349s1             xdp_port_rewrite_func native   26   3b185187f1855c4c

As can be seen from above we see xdp program loaded with tag 26

Now we loaded xdp_stats program to see packet stats:

./xdp_stats -d enP28349s1
Collecting stats from BPF map
 - BPF map (bpf_map_type:6) id:16 name:xdp_stats_map key_size:4 value_size:16 max_entries:5
XDP-action  
XDP_ABORTED            0 pkts (         0 pps)           0 Kbytes (     0 Mbits/s) period:0.250298
XDP_DROP               0 pkts (         0 pps)           0 Kbytes (     0 Mbits/s) period:0.250233
XDP_PASS               0 pkts (         0 pps)           0 Kbytes (     0 Mbits/s) period:0.250234
XDP_TX                 0 pkts (         0 pps)           0 Kbytes (     0 Mbits/s) period:0.250236
XDP_REDIRECT           0 pkts (         0 pps)           0 Kbytes (     0 Mbits/s) period:0.250239

XDP-action  
XDP_ABORTED            0 pkts (         0 pps)           0 Kbytes (     0 Mbits/s) period:2.000304
XDP_DROP               0 pkts (         0 pps)           0 Kbytes (     0 Mbits/s) period:2.000304
XDP_PASS               0 pkts (         0 pps)           0 Kbytes (     0 Mbits/s) period:2.000304
XDP_TX                 0 pkts (         0 pps)           0 Kbytes (     0 Mbits/s) period:2.000304
XDP_REDIRECT           0 pkts (         0 pps)           0 Kbytes (     0 Mbits/s) period:2.000304

maciejj-xilinx commented 3 years ago

Thanks for all that. As XDP functionality is supported in the system, I'd expect onload to either get on with installing the program or some dmesg message indicating a problem? Could you share dmesg output?

shirshen12 commented 3 years ago

Hi @maciejj-xilinx

As I had filed a ticket #37, Onload does compile successfully.

Also, please observe that mlx5_core driver version 5.4.1 has AF_XDP ZC functionality built in.

Also please note that when the NiC is registered, the bpftool o/p is as follows

Before NiC is registered

[root@sriov-onload2 ~]# bpftool prog list
[root@sriov-onload2 ~]#

After NiC is registered:

[root@sriov-onload2 ~]# bpftool prog list
6: xdp  name xdpsock  tag 278a08739491e2c9  gpl
    loaded_at 2021-07-23T11:25:44+0000  uid 0
    xlated 208B  jited 155B  memlock 4096B  map_ids 4

The dmesg o/p after NiC is registered:

[140649.416558] [onload] Onload <dev-snapshot>
[140649.416559] [onload] Copyright 2019-present Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
[140649.522209] onload_cp_server[192084]: Spawned daemon process 192106
[151049.816064] Using feature eBPF/rawtrace.
[151268.059408] [sfc efrm] efrm_nondl_register_device: register enP28143s1
[151268.059411] [sfc efrm] enP28143s1 type=4:
[151269.020115] [sfc efrm] enP28143s1 index=0 ifindex=3
[151269.020128] [onload] oo_nic_add: ifindex=3 oo_index=0

After onloading memcached:

[root@sriov-onload2 ~]# onload -p latency memcached -m 24576 -c 1024 -t 4 -u root -l 10.113.65.36:11211
oo:memcached[210058]: Using Onload <dev-snapshot> [1]
oo:memcached[210058]: Copyright 2019-present Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks

dmesg corresponding to above command execution:

[151685.072355] [onload] [1]: WARNING: huge pages are incompatible with AF_XDP. Disabling hugepage support.
[151685.083504] Using feature AF_XDP.

onload stackdump o/p, indicating stacks are created:

[root@sriov-onload2 ~]# onload_stackdump 
#stack-id stack-name      pids
1         -               -

After applying load on "Onloaded Memcached", doing a perf top yields no traffic/functions from Onload being called. Please see below screenshot:

Do we need to enable a flag for the mux in Onload to route traffic to this stack ?

maciejj-xilinx commented 3 years ago

Onload definitely requires its XDP program being installed to intercept the traffic. Since this the XDP program installation does not take effect for some reason the lack of acceleration is expected.

Results of your experiment show that there is an XDP program support in the system. It puzzles me that Onload believes it succeeded registering the device, which includes attaching the program to the net device, while system state does not support this. The BPF program itself is apparently loaded.

Since this seems to be the only thing that is missing I wonder if you could try manually bpftool net attach the loaded prog to your device. Perhaps this would give us some clue. Should this succeed would this result in accelerated traffic? Should this fail would this give some error message/code?

shirshen12 commented 3 years ago

Hello @maciejj-xilinx

Please see the following the steps I have done to get xdpsock to be detected by the driver as a XDP program to be loaded.

TL;DR: xdpsock is now detected properly but traffic still being processed by kernel stack and not Onload

Before manually attaching: (xdpsock program not listed)

[root@sriov-onload2 ~]#bpftool net list dev enP28143s1
xdp:

tc:

flow_dissector:

Manually attaching:

[root@sriov-onload2 ~]# bpftool net attach xdp name xdpsock dev enP28143s1 
[root@sriov-onload2 ~]#

Verify if program is being detected:

[root@sriov-onload2 ~]# bpftool net list dev enP28143s1
xdp:
enP28143s1(3) driver id 6

tc:

flow_dissector:

Verify using xdpdump:

[root@sriov-onload2 ~]# xdpdump -D
Interface        Prio  Program name      Mode     ID   Tag               Chain actions
--------------------------------------------------------------------------------------
lo                     <No XDP program loaded!>
eth0                   <No XDP program loaded!>
enP28143s1             xdpsock           native   6    278a08739491e2c9

dmesg o/p corresponding to above action:

[411474.900787] Using feature eBPF/xdp.
[411591.274512] [onload] [4]: WARNING: huge pages are incompatible with AF_XDP. Disabling hugepage support.

Verify onload stack being created:

[root@sriov-onload2 ~]# onload_stackdump 
#stack-id stack-name      pids
4         -               -

Now, offload memcached:

[root@sriov-onload2 ~]# onload -p latency memcached -m 24576 -c 1024 -t 4 -u root -l 10.113.65.36:11211
oo:memcached[561541]: Using Onload <dev-snapshot> [4]
oo:memcached[561541]: Copyright 2019-present Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks

Now, when we generate traffic for offloaded memcached, again no u_polls, only k_polls and hence no traffic being processed by onload stack and traffic bounced off to kernel.

[root@sriov-onload2 ~]# onload_stackdump stats | grep polls
k_polls: 318603
u_polls: 0
ioctl_evq_polls: 0
periodic_polls: 5533
interrupt_polls: 313069
deferred_polls: 0
timeout_interrupt_polls: 0

Verified that Onload is not in processing path at all, via perf top:

maciejj-xilinx commented 3 years ago

More telling would be reported number of events.: onload_stackdump stats | grep evs But I suppose this also shows 0 - this would mean no traffic hitting AF_XDP socket.

shirshen12 commented 3 years ago

Ok, so rx_evs stats are not zero:

[root@sriov-onload2 ~]# onload_stackdump stats | grep evs
rx_evs: 55990
tx_evs: 0
periodic_evs: 0
interrupt_evs: 0

This is perplexing. The u_polls are still zero.

[root@sriov-onload2 ~]# onload_stackdump stats | grep polls
k_polls: 1363836
u_polls: 0
ioctl_evq_polls: 0
periodic_polls: 7304
interrupt_polls: 1356531
deferred_polls: 0
timeout_interrupt_polls: 0

shirshen12 commented 3 years ago

I have some lead, its an issue with queue binding as seen by MLNX mlx5_core driver. I will give a detailed report shortly.

shirshen12 commented 3 years ago

So, @maciejj-xilinx this is the issue, please do take time to read this entire thread, its very informative. https://www.spinics.net/lists/xdp-newbies/msg01252.html

maciejj-xilinx commented 3 years ago

Onload makes available EF_AF_XDP_ZEROCOPY=0 that allows disabling zerocopy, and EF_IRQ_CHANNEL=<channel_no> to bind to the specific channel

By default Onload tries to enable 0 copy and uses channel selected based on CPU affinity.

shirshen12 commented 3 years ago

So, @maciejj-xilinx I have been perf tracing and it does look like xdpsock as inserted by Onload, gets exercised, but most probably the packets get bounced off to kernel stack, because of this hardware queue numbering issue as mentioned above, see below:

So, if there is an easy fix to make the packets pass via Onload, by making some command line changes then we are good, else a fix to accommodate this weird hardware queue issue from Mellanox for AF_XDP ZC is required in Onload.

Can you please validate or refute this ?

maciejj-xilinx commented 3 years ago

I was hoping that either:

EF_AF_XDP_ZEROCOPY=0 (to disable zerocopy and use standard queue number), or alternatively
EF_IRQ_CHANNEL=<high number> (to make it work for zerocopy ) would make it work

Just to double check: the traffic is just standard tcp/ip - no encapsulation, right?

Can you check for presence of relevant hardware filters?

shirshen12 commented 3 years ago

Hi @maciejj-xilinx

I see no hardware ntuple filters at all, when I do: [root@sriov-onload2 ~]# ethtool -n enP28143s1

Also traffic is standard TCP/IP and no encapsulation.

maciejj-xilinx commented 3 years ago

We'd expect a hw filter there. Well at least with SFC and Intel NICs. Have you got any indication of filter insertion error in dmesg?

If the filter is not there, I'd suggest adding a HW filter manually. For memcached the case is simple a 3 tuple filter for listen socket.

ethtool --config-ntuple enP28143s1 flow-type tcp4 dst-ip 10.113.65.36 dst-port 11211 action 4

Action is queue number. I'd suppose this need to match the queue number Onload is using and good guess would be using the high number for zerocopy.

On many NICs ethtool stats show traffic per queue this is one way to check whether the traffic goes to the right queue.

Hopefully, we will find all the collection of issues soon.

shirshen12 commented 3 years ago

So, I have not added the hw filter rule, but on further digging for XDP_REDIRECT stats on Mellanox NiCs and I find a massive number of XDP redirects.

[root@sriov-onload2 ~]# ethtool -S enP28143s1 | grep redirect
     rx_xdp_redirect: 7175774
     rx_xsk_xdp_redirect: 0
     rx0_xdp_redirect: 7175777
     rx1_xdp_redirect: 0
     rx2_xdp_redirect: 0
     rx3_xdp_redirect: 0
     rx4_xdp_redirect: 0
     rx5_xdp_redirect: 0
     rx6_xdp_redirect: 0
     rx7_xdp_redirect: 0

So, this means that Onload's XDP handler insertion program does intercept the traffic and then redirects to kernel stack.

Let me enable the hw filter and check now.

shirshen12 commented 3 years ago

So, no luck @maciejj-xilinx . I added the hardware filter and see only xdp redirects. I think it needs a deeper fix as mentioned in that earlier mailing list link, which I posted earlier.

maciejj-xilinx commented 3 years ago

A bit puzzling is that the XDP program does not get attached to the interface nor does the filter. Does some operation fail and Onload code somehow misses it? Hard to say would be good to print result of each call in af_xdp_init. Let's say the above can be worked around with manual operations as suggested earlier and we are left with the queue number issue.

From reading the ticket we should use EF_IRQ_CHANNEL=8 (the same number for ethtool action)

Onload's XDP program uses BPF map to redirect traffic from the bound queue to its XSK socket. However, Onload does not expect the queue number to be bigger than what it read from the NIC info.

It looks that this code should help. It would allow queue number 8. And would protect us in case XDP program gets misreported queue number (e.g.0) - I am not certain what to expect.

diff --git a/src/lib/efhw/af_xdp.c b/src/lib/efhw/af_xdp.c
index f3a0a5b56a..b6149eb04c 100644
--- a/src/lib/efhw/af_xdp.c
+++ b/src/lib/efhw/af_xdp.c
@@ -794,6 +794,10 @@ static int af_xdp_init(struct efhw_nic* nic, int instance,
   if( rc < 0 )
     goto out_free_user_offsets;

+  xdp_map_update(nic->af_xdp, instance % nic->vi_lim, file);
+  xdp_map_update(nic->af_xdp, instance + nic->vi_lim, file);
+
+
   /* TODO AF_XDP: currently instance number matches net_device channel */
   rc = xdp_bind(sock, nic->net_dev->ifindex, instance, vi->flags);
   if( rc == -EBUSY ) {
@@ -928,7 +932,7 @@ __af_xdp_nic_init_hardware(struct efhw_nic *nic,
        xdp->vi = (struct efhw_af_xdp_vi*) (xdp + 1);
        xdp->pd = (struct protection_domain*) (xdp->vi + nic->vi_lim);

-       rc = map_fd = xdp_map_create(sys_call_area, nic->vi_lim);
+       rc = map_fd = xdp_map_create(sys_call_area, nic->vi_lim * 2);
        if( rc < 0 )
                goto fail_map;

shirshen12 commented 3 years ago

@maciejj-xilinx so, let me generate a patch from the above fragment you posted and test again.

shirshen12 commented 3 years ago

So, @maciejj-xilinx no luck. The hardware filter that we apply in the NiC simply blackholes the traffic. I think, it will be a better idea to debug this in a Azure VM from your side.

maciejj-xilinx commented 3 years ago

The problem then seems less obvious. If I was to progress on my part I'd need access to at least the NIC model. In some respects it might be preferable to actually try this on bare-metal with the NIC model first. It is fairly likely to have more to do with mlx5 drivers than Azure VM.

zhiyisun commented 3 years ago

Hello @shirshen12 ,

I tested latest Onload (latest commit on master branch; c9ccdcdfcd9ce12930c364011ebec1c64e49fb88) on Mellanox CX4-LX. It works. Here is my setup.

Host OS: Ubuntu 20.04.3 LTS Host OS Kernel: 5.4.0-88-generic #99 Guest OS: Ubuntu 20.04.3 LTS Guest OS Kernel: 5.4.0-88-generic #99 CPU: AMD EPYC 7713 64-Core Processor vCPU for Guest: 32 memory for Guest: 64GB Host GRUB kernel parameter /proc/cmdline: BOOT_IMAGE=/boot/vmlinuz-5.4.0-88-generic root=UUID=3e63ff61-49f4-40ef-a0b2-9b991fc03a9c ro maybe-ubiquity iommu=pt amd_iommu=on transparent_hugepage=never hpet=disable tsc=reliable selinux=0 processor.max_cstate=0 Guest GRUB Kernel parameter /proc/cmdline: BOOT_IMAGE=/boot/vmlinuz-5.4.0-88-generic root=UUID=6b18114a-e2da-4d1f-b897-39b55cca2031 ro console=tty1 console=ttyS0

Mellanox Card driver and firmware on host OS

[] % ethtool -i cx4eth0
driver: mlx5_core
version: 5.4-1.0.3
firmware-version: 14.23.1020 (MT_2470111034)
expansion-rom-version: 
bus-info: 0000:21:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

[] % lspci | grep Mellanox
21:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
21:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
21:00.2 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
21:00.3 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]

Mellanox NIC card info in VM

[ubuntu@vm01:~/workspace/code/onload] master ± ethtool -i enp5s0
driver: mlx5_core
version: 5.0-0
firmware-version: 14.23.1020 (MT_2470111034)
expansion-rom-version: 
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

Parameter for VM creation: 21:00.2 is the VF to VM

virt-install --connect qemu:///system --virt-type kvm --name vm01 --ram 65536 --vcpus=32 --os-type linux --os-variant ubuntu20.04 --disk path=vm01.qcow2,device=disk --disk path=vm01-seed.qcow2,device=disk --import --network network=default,model=virtio,mac=$MAC_ADDR --noautoconsole --host-device=pci_0000_21_00_2

Then follow the DEVELOPING.md to build onload and register Mellanox NIC VF to AF_XDP.

Tested onload with netperf. (Please ignore the performance. Because this Mellanox NIC is the management interface of my server which is connected to a 1G switch. :-( )

vm01# ./onload -p latency netperf -H 10.23.81.3 -t TCP_STREAM
oo:netperf[8819]: Using Onload <dev-snapshot> [7]
oo:netperf[8819]: Copyright 2019-present Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.23.81.3 () port 0 AF_INET : demo
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

131072  16384  16384    10.00     935.85

In another terminal, I can see u_polls increased.

vm01# ./onload_stackdump stats | grep polls
k_polls: 2
u_polls: 127759
ioctl_evq_polls: 0
periodic_polls: 2
interrupt_polls: 0
deferred_polls: 0
timeout_interrupt_polls: 0

Hope it helps.

Regards, Zhiyi

shirshen12 commented 3 years ago

@zhiyisun This is very helpful, this might mean something is off in Host side in Azure. As you can see from prior comments, I am trying on CentOS 8, 4.18+ kernel. Also, Onload-on-AF_XDP works Intel NiCs on CentOS 4.18+, but yes let me try on Ubuntu 20.04 LTS (5.4+) and try to reproduce your steps.

shirshen12 commented 3 years ago

So @zhiyisun I have been able to validate the instructions in vultr.com which has a single port 25GBe Mellanox Connect X 5 X NiC on Ubuntu 20.04 LTS

But, those instructions dont work on Ubuntu 20.04 LTS in Azure Ds_v4 type of VM (with SR-IOV passthrough access).

It would be awesome if we can get on a call and debug why aint working, are you ok with that ?

zhiyisun commented 3 years ago

Hello @shirshen12 , actually, I am able to replicate the issue on Azure. It's the same as you described above. We need some time to debug this issue. Will let you know when we get some clue.

maciejj-xilinx commented 3 years ago

@Shirshen12. Regarding HW filter insertion in netvsc environment. With your comment #issuecomment-893672203 you have indicated you attempted to insert HW filter. I understood it succeeded as you have not reported any error with this command.. However xdp-redirect stats in subsequent comment did not show the filter took effect. So to clarify:

Command "ethtool --config-ntuple enP28143s1 flow-type tcp4 dst-ip 10.113.65.36 dst-port 11211 action 4" succeeded
"ethtool -n enP28143s1 " listed no filter (and no traffic on queue 4 anyway)

shirshen12 commented 3 years ago

@maciejj-xilinx yes. I was able to insert the filter and it did show the filter as well, but the traffic either got blackholed or when i dont insert the rule it does bounce the traffic to kernel transparently. Also, in this case the traffic always showed up at queue-0 but per the above link, it shows some weird issue in numbering. The problem is what must be that hardware queue number ? In Intel NiCs we see those hardware filter rules transparently added.

shirshen12 commented 3 years ago

Also @maciejj-xilinx and @zhiyisun, to test your work you will need to run a VM with:

CentOS Linux 8 with 4.18+ kernel OR
Ubuntu 20.04 LTS

and upgrade the driver to Mellanox_OFED 5.4+ since this version onwards AF_XDP ZC support is enabled.

shirshen12 commented 3 years ago

@maciejj-xilinx / @zhiyisun can we check if Onload can fallback into XDP generic mode if there are certain clauses absent DRV mode, like say the XDP_REDIRECT mode in hv_netsvc driver ?

shirshen12 commented 2 years ago

@maciejj-xilinx / @zhiyisun currently MSFT Azure is implementing the XDP_REDIRECT clause in Hyper-V PCI driver, I think after that it will start working, because if the dummy driver issue.

shirshen12 commented 2 years ago

The patch has been implemented, https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=1cb9d3b6185b

I will test and report and close this if it works.

shirshen12 commented 2 years ago

The patch is now in mainline Linux kernel:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/drivers/net/hyperv?h=v5.19-rc1

and in Ubuntu as well: https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/log/drivers/net/hyperv

Xilinx-CNS / onload

Application offloaded to Onload on AF_XDP on Mellanox NiC in Azure does not process traffic and bounces off traffic to kernel stack #39