Xilinx-CNS / onload

OpenOnload high performance user-level network stack
Other
546 stars 90 forks source link

does OpenOnload + AF_XDP + Intel NIC support nginx multiple workers ? #70

Open ligong234 opened 2 years ago

ligong234 commented 2 years ago

Hello Onload Team,

I am follow the instructions in https://www.xilinx.com/publications/onload/sf-onload-nginx-proxy-cookbook.pdf, and managed to make OpenOnload work on CentOS Linux release 7.6.1810 (Core) plus CentOS 8.4 kernel 4.18.0-305.12.1, all onload drivers were loaded successfully, and I can register the Intel XXV710 NIC to onload, I want to do nginx proxy benchmark test, start with four worker process, from another machine I uses wrk to generate the http requests, I noticed there is only one nginx process is handling requests, while others are all idle, if I kill this busy nginx, a new nginx process is forked and start handling http requests, the other three nginx are always idle, so the question is: does OpenOnload + AF_XDP

my environment setup is as below:

[root@localhost openonload]# cat /etc/redhat-release CentOS Linux release 7.6.1810 (Core)

[root@localhost openonload]# uname -a Linux localhost 4.18.0-305.12.1.el7.centos.x86_64 #1 SMP Wed Aug 25 14:27:38 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

[root@localhost openonload]# ethtool -i eth1 driver: i40e version: 4.18.0-305.12.1.el7.centos.x86_ firmware-version: 6.01 0x8000354e 1.1747.0 expansion-rom-version: bus-info: 0000:5e:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes

[root@localhost openonload]# lspci -s 5e:00.1 -v 5e:00.1 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02) Subsystem: Intel Corporation Ethernet Network Adapter XXV710 Physical Slot: 3 Flags: bus master, fast devsel, latency 0, IRQ 657, NUMA node 0 Memory at c3000000 (64-bit, prefetchable) [size=16M] Memory at c5800000 (64-bit, prefetchable) [size=32K] Expansion ROM at c5e00000 [disabled] [size=512K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable+ Count=129 Masked- Capabilities: [a0] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 78-d4-1c-ff-ff-b7-a6-40 Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Capabilities: [160] Single Root I/O Virtualization (SR-IOV) Capabilities: [1a0] Transaction Processing Hints Capabilities: [1b0] Access Control Services Kernel driver in use: i40e Kernel modules: i40e

[root@localhost openonload]# rpm -qi nginx Name : nginx Epoch : 1 Version : 1.16.0 Release : 1.el7.ngx Architecture: x86_64 Install Date: Tue 04 Jan 2022 04:04:30 PM CST Group : System Environment/Daemons Size : 2811760 License : 2-clause BSD-like license Signature : RSA/SHA1, Tue 23 Apr 2019 11:13:55 PM CST, Key ID abf5bd827bd9bf62 Source RPM : nginx-1.16.0-1.el7.ngx.src.rpm Build Date : Tue 23 Apr 2019 10:36:28 PM CST Build Host : centos74-amd64-builder-builder.gnt.nginx.com Relocations : (not relocatable) Vendor : Nginx, Inc. URL : http://nginx.org/ Summary : High performance web server Description : nginx [engine x] is an HTTP and reverse proxy server, as well as a mail proxy server.

[root@localhost openonload]# cat /usr/libexec/onload/profiles/latency-af-xdp.opf onload_set EF_POLL_USEC 100000 onload_set EF_AF_XDP_ZEROCOPY 0 onload_set EF_TCP_SYNRECV_MAX 8192 onload_set EF_MAX_ENDPOINTS 8192 onload_set EF_TCP_FASTSTART_INIT 0 onload_set EF_TCP_FASTSTART_IDLE 0

[root@localhost openonload]# cat /etc/nginx/nginx-proxy-node0-4-worker.conf user root root; worker_processes 4; worker_rlimit_nofile 8388608; worker_cpu_affinity 01 010 0100 01000 ; pid /var/run/nginx-node0_4.pid; events { multi_accept off; accept_mutex off; use epoll; worker_connections 200000; } error_log /var/log/error-node0_4.log debug; http { default_type application/octet-stream; access_log off; error_log /dev/null crit; sendfile on; proxy_buffering off; keepalive_timeout 300s; keepalive_requests 1000000; server { listen 10.19.1.43:80 reuseport; listen 10.19.1.43:81 reuseport; server_name localhost; error_page 500 502 503 504 /50x.html; location = /50x.html { root html; } location / { proxy_pass http://backend; proxy_http_version 1.1; proxy_set_header Connection ""; } } upstream backend { server 10.96.10.21:80 ; keepalive 500; } }

steps to reproduce the problem:

  1. load onload diver [root@localhost openonload]# onload_tool reload onload_tool: /sbin/modprobe sfc onload_tool: /sbin/modprobe onload

  2. register nic

[root@localhost openonload]# ethtool -K eth1 ntuple on [root@localhost openonload]# ethtool -k eth1 | grep ntuple [root@localhost openonload]# echo eth1 > /sys/module/sfc_resource/afxdp/register

  1. start nginx with onload

[root@localhost openonload]# /bin/onload -p latency-af-xdp /sbin/nginx -c /etc/nginx/nginx-proxy-node0-4-worker.conf oo:nginx[32964]: Using Onload 20211221 [7] oo:nginx[32964]: Copyright 2019-2021 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks oo:nginx[33000]: onload_setrlimit64: RLIMIT_NOFILE: hard limit requested 8388608, but set to 655360 oo:nginx[33000]: onload_setrlimit64: RLIMIT_NOFILE: soft limit requested 8388608, but set to 655360 oo:nginx[33002]: onload_setrlimit64: RLIMIT_NOFILE: hard limit requested 8388608, but set to 655360 oo:nginx[33002]: onload_setrlimit64: RLIMIT_NOFILE: soft limit requested 8388608, but set to 655360 oo:nginx[33004]: onload_setrlimit64: RLIMIT_NOFILE: hard limit requested 8388608, but set to 655360 oo:nginx[33004]: onload_setrlimit64: RLIMIT_NOFILE: soft limit requested 8388608, but set to 655360 oo:nginx[33007]: onload_setrlimit64: RLIMIT_NOFILE: hard limit requested 8388608, but set to 655360 oo:nginx[33007]: onload_setrlimit64: RLIMIT_NOFILE: soft limit requested 8388608, but set to 655360

[root@localhost openonload]# ps -ef | grep nginx root 32999 1 0 11:28 ? 00:00:00 nginx: master process /sbin/nginx -c /etc/nginx/nginx-proxy-node0-4-worker.conf root 33000 32999 1 11:28 ? 00:00:00 nginx: worker process root 33002 32999 2 11:28 ? 00:00:00 nginx: worker process root 33004 32999 2 11:28 ? 00:00:00 nginx: worker process root 33007 32999 1 11:28 ? 00:00:00 nginx: worker process root 33013 55380 0 11:28 pts/1 00:00:00 grep --color=auto nginx

[root@localhost openonload]# oo:nginx[33007]: Using Onload 20211221 [0] oo:nginx[33007]: Copyright 2019-2021 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks

  1. start wrk on another machine

[root@localhost benchmark]# wrk -c 3200 -d 60 -t 32 --latency http://10.19.1.43/1kb.bin

  1. top show only one nginx process is busy

[root@localhost openonload]# top top - 11:19:29 up 1 day, 19:13, 3 users, load average: 1.44, 1.31, 1.25 Tasks: 771 total, 2 running, 389 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.6 us, 0.8 sy, 0.0 ni, 98.1 id, 0.0 wa, 0.0 hi, 0.4 si, 0.0 st KiB Mem : 26350249+total, 22556315+free, 27453884 used, 10485452 buff/cache KiB Swap: 0 total, 0 free, 0 used. 23326888+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
33007 root 20 0 335736 270784 151832 R 81.7 0.1 0:23.72 nginx <================
27999 root 0 -20 0 0 0 I 11.6 0.0 0:03.19 kworker/u132:0- 14505 root 20 0 90520 5144 4276 S 0.3 0.0 0:17.66 rngd
28034 root 20 0 162716 5208 3804 R 0.3 0.0 0:00.12 top

Can someone point out what is wrong with my setup, or share some information regarding AF_XDP + XXV710 , any help is appreciated.

Best Regards Ligong

maciejj-xilinx commented 2 years ago

Hello Ligong,

Thanks for the interest and detailed info. The whitepaper is based on using Solarflare NIC which gave fairly smooth experience. Also note that profile for nginx-proxy given in the whitepaper had quite few specific options enabled.

With your set-up do you use two NICs? One for upstream and one for downstream? And eth1 - the NIC you are accelerating with onload - is upstream or downstream (or both)?

Accelerating upstream and downstream are somewhat separate problems and best to tackle them one by one. It might take few steps to get where you want to be.

Firstly, on downstream side to be able to receive traffic to multiple stacks a RSS support is required. We have not tried using RSS on non-solarlafer NICs. And i40e ntuple fitlering has limitation - no RSS. The workaround is to not using ntuple filter but rely on existing kernel MAC filter with RSS enabled.

Before jumping to RSS yet I'd advise tuning Nginx in single worker mode to see whether you get appropriate performance.

Best Ragards, Maciej

shirshen12 commented 2 years ago

Hi @maciejj-xilinx , this explains the problem we see with memcached in multi-threaded mode on Mellanox NiCs as well. When memcached is offloaded via Onload-on-AF_XDP, we see only one thread processing traffic and rest all else idle. On deeper inspection, we see ethtool -S <ifname> | grep xdp_redirect showing xdp_redirect counters increasing only on 1 queue and rest is all zero.

Is there any patch you can apply quickly to use RSS on non-SFC NiCs or can you give the instructions for using kernel MAC filters with RSS enabled ?

ligong234 commented 2 years ago

Hello Ligong,

Thanks for the interest and detailed info. The whitepaper is based on using Solarflare NIC which gave fairly smooth experience. Also note that profile for nginx-proxy given in the whitepaper had quite few specific options enabled.

With your set-up do you use two NICs? One for upstream and one for downstream? And eth1 - the NIC you are accelerating with onload - is upstream or downstream (or both)?

Accelerating upstream and downstream are somewhat separate problems and best to tackle them one by one. It might take few steps to get where you want to be.

Firstly, on downstream side to be able to receive traffic to multiple stacks a RSS support is required. We have not tried using RSS on non-solarlafer NICs. And i40e ntuple fitlering has limitation - no RSS. The workaround is to not using ntuple filter but rely on existing kernel MAC filter with RSS enabled.

Before jumping to RSS yet I'd advise tuning Nginx in single worker mode to see whether you get appropriate performance.

Best Ragards, Maciej

Hi Maciej,

Thanks for your quick reply. Basically we want to get a onload number quickly, my setup is pretty simple, there are two machines both equiped with two Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz CPU, and 256G RAM, machine one act as onload nginx proxy, machine two act as nginx origin and run wrk, the nginx proxy machine installs a Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02) dual ports adapter, the nginx origin machine installs a Mellanox Technologies Stand-up ConnectX-4 Lx EN, 25GbE dual-port SFP28 adapter. each pair of the adapter ports connected with a dedicated network switch. both machines installed with CentOS 7.6 and Linux kernel 4.18.0-305.12.1, and the latest version of nginx.

 +-----------------------+                                     +-----------------------+
 |                       |-eth0  ------- switch 1 ------- eth0-| nginx origin (CPU 0~4)|
 |  onload nginx proxy   |                                     |                       |
 |       (CPU 0~4)       |-eth1 -------- switch 2 ------- eth1-| wrk (CPU 32~63)       |
 +-----------------------+                                     +-----------------------+
         machine one                                                  machine two

for this setup onload nginx proxy uses two XXV710 NIC ports, and eth1 is nginx proxy downstream, eth0 is upstream, run wrk on CPU 32~63 to make it generates more requests, and both nginx run on CPU 0~4, run onload on machine one only.

One thing to mention before this test I also run one nginx worker test, the Onload number is pretty good, outperform a lot of the Linux kernel.

for this setup I make two tests, one let onload acclerates both eth0 and eth1, and two acclaretes eth1 only, both tests show only one onload acclerated nginx is busy (top CPU is high), now I am focusing on how to make nginx downstream spread traffic to multiple workers.

for the i40 NIC you mentioned the workaronud by rely on kernel MAC filter with RSS enabled, what is configure option or linux command I can use to active it ?

I have take a look at the onload AF_XDP support code (point me out if I am wrong), the kernel side efhw\af_xdp.c nic init hw load XDP prog and attach to NIC, and af_xdp_init create AF_XDP socket on behalf of onload acclerated process, and then register UMEM and rings, and grab its kernel mapping address, then the user space part libonload.so mmap rings to userspace, now both kernel module and user space process can operate on AF_XDP rings. the AF_XDP socket bind to one NIC and one queue id, and rely on NIC hardware filters to redirect ingress traffic to that queue.

The Linux kernel Documentation\networking\af_xdp.rst mentioned the ring structure are single-consumer/single-producer, for nginx single worker case, I believe onload will take care of it and allow either user space or kernel touch the rings while not interfered each other. The problem raise when nginx fork multiple worker process, then there are multiple copy of the rings, I want to know how onload coordinates the concurrent ring access while not breaking Linux kernel ring assumption of single-consumer/single-producer ? and from hardware point of view, there is only on NIC queue is utilized, does this will became a bottleneck ? and how can we utilize multiple NIC queues or we do not need to care about it ?

Best Regards, Ligong

shirshen12 commented 2 years ago

Hi @maciejj-xilinx

Can you please respond to this question:

from hardware point of view, there is only on NIC queue is utilized, does this will became a bottleneck ? and how can we utilize multiple NIC queues or we do not need to care about it ?

and

to this question:

kernel MAC filter with RSS enabled, what is configure option or linux command I can use to active it ?

shirshen12 commented 2 years ago

@ligong234 I think @maciejj-xilinx maybe talking about this: https://github.com/Xilinx-CNS/onload/issues/28#issuecomment-1054526865

shirshen12 commented 2 years ago

Well it looks like echo 0 > /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters does not add filters, but ethtool -S ens5 | grep xdp shows that redirect clause is being triggered on only one queue still

ethtool -S ens5 | grep xdp

Screenshot 2022-03-29 at 11 59 41 PM
maciejj-xilinx commented 2 years ago

I have given this some though but was not able to put together all the steps and test. This might take a bit of time. Worth noting that to give a smoother experience it would take some code changes. In the meantime we can try how feasible this would be.

@shirshen12 is this with single Onload stack? to capture traffic on multiple queues, multiple stacks are needed.

This is a simple test on rx side with use of rss, with enable_af_xdp_flow_filters=0

echo 0 > /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters
sudo env PATH="$PATH" EF_IRQ_CHANNEL=0 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &
sudo env PATH="$PATH" EF_IRQ_CHANNEL=1 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &
EF_IRQ_CHANNEL=2 ...

Best to start as many instances as many RX channels your device got setup - not less not more - this is crucial to cover entire traffic. The number of channels can be adjusted: e.g. ethool -l combined 2 for just two channels.

I would expect this to work for @ligong234 with Intel NIC, with Mellanox ... probably - easy to check.

The test is to do several connection attempts from the peer host and see if both onload stacks show some traffic.

for((i=0;i<16;++i)); do wget host:12345 & done

In my case both stacks opened some half of the connections:

onload_stackdump lots | grep listen2synrecv
listen2synrecv: 2
listen2synrecv: 2
ligong234 commented 2 years ago

@maciejj-xilinx, @shirshen12, Thanks your guys for share the valuable information, I will give it a try and post the result shortly. Ligong

ligong234 commented 2 years ago

unfortunately, my setup does not have the parameter "enable_af_xdp_flow_filters", and onload source code does not contains this string. my setup is derived from " 2021-12-15 [Jamal Mulla] ON-13728: Fixes the missing CTPIO ptr issue (#856)", @maciejj-xilinx which onload version are you running ?

[root@localhost ~]# lsmod | grep onload
onload                827392  4 
sfc_char              118784  1 onload
sfc_resource          192512  2 onload,sfc_char
[root@localhost ~]#
[root@localhost ~]# ll /sys/module/sfc_resource/parameters/
total 0
-rw-r--r-- 1 root root 4096 Mar 30 10:40 enable_accel_by_default
-rw-r--r-- 1 root root 4096 Mar 30 10:40 enable_driverlink
-r--r--r-- 1 root root 4096 Mar 30 10:40 force_ev_timer
-r--r--r-- 1 root root 4096 Mar 30 10:40 pio
[root@localhost ~]#
[root@localhost ~]# find /sys/module/ -name enable_af_xdp_flow_filters
[root@localhost ~]# 

[root@localhost onload]# grep -rn enable_af_xdp_flow_filters .
[root@localhost onload]#
shirshen12 commented 2 years ago

Have you registered your NiC to AF_XDP @ligong234 ?

ligong234 commented 2 years ago

@shirshen12 Yes I do.

[root@localhost openonload]# onload_tool reload
onload_tool: /sbin/modprobe sfc
onload_tool: /sbin/modprobe onload
[root@localhost openonload]#

[root@localhost openonload]# cat do-register-nic 
dmesg -c

ethtool -K eth0 ntuple on
ethtool -K eth1 ntuple on
ethtool -k eth0 | grep ntuple
ethtool -k eth1 | grep ntuple

#echo eth0 > /sys/module/sfc_resource/afxdp/register
echo eth1 > /sys/module/sfc_resource/afxdp/register

dmesg -c

[root@localhost openonload]# . do-register-nic 
[ 3554.208521] Efx driverlink unregistering resource driver
[ 3554.224503] Solarflare driverlink driver unloading
[ 3564.259164] Solarflare driverlink driver v5.3.12.1008 API v33.0
[ 3564.263516] Solarflare NET driver v5.3.12.1008
[ 3564.286037] Efx driverlink registering resource driver
[ 3564.327892] [onload] Onload 20211221
[ 3564.327915] [onload] Copyright 2019-2021 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
[ 3564.432495] onload_cp_server[57195]: Spawned daemon process 57225
ntuple-filters: on
ntuple-filters: on
[ 3578.875681] [sfc efrm] efrm_nondl_register_device: register eth1
[ 3578.875889] [sfc efrm] eth1 type=4:
[ 3579.103249] irq 749: Affinity broken due to vector space exhaustion.
[ 3579.103262] irq 750: Affinity broken due to vector space exhaustion.
[ 3579.103276] irq 751: Affinity broken due to vector space exhaustion.
[ 3579.103289] irq 752: Affinity broken due to vector space exhaustion.
[ 3579.103302] irq 753: Affinity broken due to vector space exhaustion.
[ 3579.103314] irq 754: Affinity broken due to vector space exhaustion.
[ 3579.103326] irq 755: Affinity broken due to vector space exhaustion.
[ 3579.103339] irq 756: Affinity broken due to vector space exhaustion.
[ 3579.103352] irq 757: Affinity broken due to vector space exhaustion.
[ 3579.103364] irq 758: Affinity broken due to vector space exhaustion.
[ 3579.103377] irq 759: Affinity broken due to vector space exhaustion.
[ 3579.103389] irq 760: Affinity broken due to vector space exhaustion.
[ 3579.103402] irq 761: Affinity broken due to vector space exhaustion.
[ 3579.103415] irq 762: Affinity broken due to vector space exhaustion.
[ 3579.103428] irq 763: Affinity broken due to vector space exhaustion.
[ 3579.103441] irq 764: Affinity broken due to vector space exhaustion.
[ 3579.103633] irq 781: Affinity broken due to vector space exhaustion.
[ 3579.103644] irq 782: Affinity broken due to vector space exhaustion.
[ 3579.103657] irq 783: Affinity broken due to vector space exhaustion.
[ 3579.103670] irq 784: Affinity broken due to vector space exhaustion.
[ 3579.103683] irq 785: Affinity broken due to vector space exhaustion.
[ 3579.103696] irq 786: Affinity broken due to vector space exhaustion.
[ 3579.103708] irq 787: Affinity broken due to vector space exhaustion.
[ 3579.103721] irq 788: Affinity broken due to vector space exhaustion.
[ 3579.103733] irq 789: Affinity broken due to vector space exhaustion.
[ 3579.103748] irq 790: Affinity broken due to vector space exhaustion.
[ 3579.103761] irq 791: Affinity broken due to vector space exhaustion.
[ 3579.103775] irq 792: Affinity broken due to vector space exhaustion.
[ 3579.103786] irq 793: Affinity broken due to vector space exhaustion.
[ 3579.103800] irq 794: Affinity broken due to vector space exhaustion.
[ 3579.103814] irq 795: Affinity broken due to vector space exhaustion.
[ 3579.103826] irq 796: Affinity broken due to vector space exhaustion.
[ 3579.104432] [sfc efrm] eth1 index=0 ifindex=3
[ 3579.104438] [onload] oo_nic_add: ifindex=3 oo_index=0
[root@localhost openonload]# 
[root@localhost openonload]# ll /sys/module/sfc_resource/parameters/ 
total 0
-rw-r--r-- 1 root root 4096 Mar 30 11:29 enable_accel_by_default
-rw-r--r-- 1 root root 4096 Mar 30 11:29 enable_driverlink
-r--r--r-- 1 root root 4096 Mar 30 11:29 force_ev_timer
-r--r--r-- 1 root root 4096 Mar 30 11:29 pio
[root@localhost openonload]# 
shirshen12 commented 2 years ago

Is it a Intel NiC or Mellanox NiC ? If its a Intel NiC, you need to enable flow director. I can give you exact instructions per the NiC make.

ligong234 commented 2 years ago

@shirshen12 it is Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)

shirshen12 commented 2 years ago

please see instructions for i40e below, please follow as is. Its for Ubuntu 21.04 LTS

Driver: ixgbe OS: Ubuntu 21.04 LTS

upgrade to latest OS kernel

apt update -y
apt upgrade -y
apt full-upgrade -y

reboot into new kernel reboot

Install dependencies

apt install build-essential net-tools unzip libcap-dev linux-tools-common linux-tools-generic netperf libevent-dev libnl-route-3-dev tk bison tcl libnl-3-dev flex libnl-route-3-200 dracut python2 libpcap-dev -y
apt install initramfs-tools -y

build the intel driver, ixgbe

wget https://downloadmirror.intel.com/682680/ixgbe-5.13.4.tar.gz
tar zxf ixgbe-5.13.4.tar.gz
cd ixgbe-5.13.4/src/
make install

build the intel driver, i40e

wget http://downloadmirror.intel.com/709707/i40e-2.17.15.tar.gz
tar zxf i40e-2.17.15.tar.gz
cd i40e-2.17.15/src/
make install

The binary will be installed as: /lib/modules/<KERNEL VER>/updates/drivers/net/ethernet/intel/ixgbe/ixgbe.ko

Load the ixgbe module using the modprobe command. rmmod ixgbe; modprobe ixgbe

Load the i40e module using the modprobe command. rmmod i40e; modprobe i40e

update the initrd/initramfs file to prevent the OS loading old versions of the ixgbe driver. update-initramfs -u

reboot again, just for safety reboot

Install Onload:

git clone https://github.com/Xilinx-CNS/onload.git
cd onload
scripts/onload_mkdist --release
cd onload-<version>/scripts/
./onload_install
./onload_tool reload

_register the NiC with AFXDP driver interface echo enp1s0 > /sys/module/sfc_resource/afxdp/register

turn on Intel Flow Director ethtool --features enp1s0 ntuple on

You are set!

ligong234 commented 2 years ago

@shirshen12 Thanks for sharing the detail instructions, that is exactly what I did, a little bit difference is that my system is CentOS and use kernel in tree i40e driver, the single worker onload nginx is working fine. The problem I have right now is my onload resource driver does not have the "enable_af_xdp_flow_filters" parameter as maciejj-xilinx pointed out, which prevent me from running multiple instances of onload nginx.

This is a simple test on rx side with use of rss, with enable_af_xdp_flow_filters=0

echo 0 > /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters
sudo env PATH="$PATH" EF_IRQ_CHANNEL=0 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &
sudo env PATH="$PATH" EF_IRQ_CHANNEL=1 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &

my env and instructions are as below:

[root@localhost openonload]# ethtool -i eth1
driver: i40e
version: 4.18.0-305.12.1.el7.centos.x86_
firmware-version: 6.01 0x8000354e 1.1747.0
expansion-rom-version: 
bus-info: 0000:5e:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

[root@localhost openonload]# lspci -s 0000:5e:00.1 -v
5e:00.1 Ethernet controller: Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
        Subsystem: Intel Corporation Ethernet Network Adapter XXV710
        Physical Slot: 3
        Flags: bus master, fast devsel, latency 0, IRQ 582, NUMA node 0
        Memory at c3000000 (64-bit, prefetchable) [size=16M]
        Memory at c5800000 (64-bit, prefetchable) [size=32K]
        Expansion ROM at c5e00000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number 78-d4-1c-ff-ff-b7-a6-40
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [1a0] Transaction Processing Hints
        Capabilities: [1b0] Access Control Services
        Kernel driver in use: i40e
        Kernel modules: i40e

[root@localhost openonload]# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 

[root@localhost openonload]# uname -a
Linux localhost 4.18.0-305.12.1.el7.centos.x86_64 #1 SMP Wed Aug 25 14:27:38 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

[root@localhost openonload]# modinfo i40e
filename:       /lib/modules/4.18.0-305.12.1.el7.centos.x86_64/kernel/drivers/net/ethernet/intel/i40e/i40e.ko.xz
version:        4.18.0-305.12.1.el7.centos.x86_64
license:        GPL v2
description:    Intel(R) Ethernet Connection XL710 Network Driver
author:         Intel Corporation, <e1000-devel@lists.sourceforge.net>
rhelversion:    8.4
srcversion:     78E81CDBAAC80E980F550F5
alias:          pci:v00008086d0000158Bsv*sd*bc*sc*i*
...
alias:          pci:v00008086d00001572sv*sd*bc*sc*i*
depends:        
intree:         Y
name:           i40e
vermagic:       4.18.0-305.12.1.el7.centos.x86_64 SMP mod_unload modversions 
parm:           debug:Debug level (0=none,...,16=all), Debug mask (0x8XXXXXXX) (uint)
[root@localhost openonload]# 

[root@localhost openonload]# ethtool -K eth1 ntuple on
[root@localhost openonload]# onload_tool reload
onload_tool: /sbin/modprobe sfc
onload_tool: /sbin/modprobe onload
[root@localhost openonload]# echo eth1 > /sys/module/sfc_resource/afxdp/register
[root@localhost openonload]# ls -l  /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters
ls: cannot access /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters: No such file or directory
[root@localhost openonload]#
shirshen12 commented 2 years ago

Can you move to Red Hat 8 or CentOS 8 ?I know you are on 4.18+ but it looks like somehow the eBPF VM is not baked into CentOS 7.6 (it was in preview mode last I knew, not production grade)

abower-amd commented 2 years ago

Hi @ligong234,

The problem I have right now is my onload resource driver does not have the "enable_af_xdp_flow_filters" parameter

You need to update your source tree. This feature got added with b8ba4e24a48bf145b4ad782c7c19491399825d4e on 28 Feb 2022.

Andy

ligong234 commented 2 years ago

@abower-xilinx Thanks you I will give it a try

shirshen12 commented 2 years ago

Did it work for you @ligong234 . I thought you were using latest master branch of onload.

ligong234 commented 2 years ago

@shirshen12 I have not try the latest master branch, my test is basing on Onload 2021-12-15 commit, I will try the latest master and report the result.

shirshen12 commented 2 years ago

I have given this some though but was not able to put together all the steps and test. This might take a bit of time. Worth noting that to give a smoother experience it would take some code changes. In the meantime we can try how feasible this would be.

@shirshen12 is this with single Onload stack? to capture traffic on multiple queues, multiple stacks are needed.

This is a simple test on rx side with use of rss, with enable_af_xdp_flow_filters=0

echo 0 > /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters
sudo env PATH="$PATH" EF_IRQ_CHANNEL=0 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &
sudo env PATH="$PATH" EF_IRQ_CHANNEL=1 onload /home/maciejj/bin/ext-simple/interactive -c 'socket;setsockopt SOL_SOCKET SO_REUSEADDR 1; setsockopt SOL_SOCKET SO_REUSEPORT 1; bind 12345;listen;' &
EF_IRQ_CHANNEL=2 ...

Best to start as many instances as many RX channels your device got setup - not less not more - this is crucial to cover entire traffic. The number of channels can be adjusted: e.g. ethool -l combined 2 for just two channels.

I would expect this to work for @ligong234 with Intel NIC, with Mellanox ... probably - easy to check.

The test is to do several connection attempts from the peer host and see if both onload stacks show some traffic.

for((i=0;i<16;++i)); do wget host:12345 & done

In my case both stacks opened some half of the connections:

onload_stackdump lots | grep listen2synrecv
listen2synrecv: 2
listen2synrecv: 2

Hi @maciejj-xilinx its for multhreaded memcached, memcached -t 4

ligong234 commented 2 years ago

Update, today I try the Onload latest master branch, and hit another error. multiple nginx instances start successfully, when I start wrk, nginx complain failed to allocate stack, and dmesg shows out of VI instances, and the error code is -EBUSY.

oo:nginx[21345]: netif_tcp_helper_alloc_u: ERROR: Failed to allocate stack (rc=-16) [ 1582.957667] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)

# cat /usr/include/asm-generic/errno-base.h
...
#define EBUSY           16      /* Device or resource busy */

steps to reproduce, I turn off the xdp zerocopy, when it is on , onload failed to allocate UMEM

i40e 0000:5e:00.1: Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring 1 i40e 0000:5e:00.1: Failed to allocate some buffers on UMEM enabled Rx ring 1 (pf_q 66)

[root@localhost benchmark]# cat /usr/libexec/onload/profiles/latency-af-xdp.opf 
# SPDX-License-Identifier: BSD-2-Clause
# X-SPDX-Copyright-Text: (c) Copyright 2010-2019 Xilinx, Inc.

# OpenOnload low latency profile.

# Enable polling / spinning.  When the application makes a blocking call
# such as recv() or poll(), this causes Onload to busy wait for up to 100ms
# before blocking.
#
onload_set EF_POLL_USEC 100000

# enable AF_XDP for Onload
#onload_set EF_AF_XDP_ZEROCOPY 1
onload_set EF_AF_XDP_ZEROCOPY 0
onload_set EF_TCP_SYNRECV_MAX 8192
onload_set EF_MAX_ENDPOINTS 8192

# Disable FASTSTART when connection is new or has been idle for a while.
# The additional acks it causes add latency on the receive path.
onload_set EF_TCP_FASTSTART_INIT 0
onload_set EF_TCP_FASTSTART_IDLE 0

[root@localhost benchmark]# cat start-nginx-proxy-onload-workaround.sh 
#!/bin/bash
sysctl -w net.ipv4.ip_local_port_range='9000 65000';
sysctl -w vm.nr_hugepages=10000;
sysctl -w fs.file-max=8388608;
sysctl -w fs.nr_open=8388608;
ulimit -n 8388608;

echo 0 > /sys/module/sfc_resource/parameters/enable_af_xdp_flow_filters

# Start Nginx proxy
function start_nginx() {
        local cpu=$1
        export EF_IRQ_CHANNEL=$cpu
        taskset -c $cpu /bin/onload -p latency-af-xdp \
                /sbin/nginx -c /etc/nginx/nginx-proxy-one-worker.conf \
                        -g "pid /var/run/nginx-worker-${cpu}.pid; error_log /var/log/nginx-worker-${cpu}-error.log; " \
        &
}

nginx_works=4
for ((i=0; i<$nginx_works; i++)) ; do
        start_nginx $i
done

ps -ef | grep nginx

[root@localhost benchmark]# . start-nginx-proxy-onload-workaround.sh
net.ipv4.ip_local_port_range = 9000 65000
vm.nr_hugepages = 10000
fs.file-max = 8388608
fs.nr_open = 8388608
root     21212 17533  0 09:03 pts/0    00:00:00 /sbin/nginx -c /etc/nginx/nginx-proxy-one-worker.conf -g pid /var/run/nginx-worker-0.pid; error_log /var/log/nginx-worker-0-error.log;
root     21213 17533  0 09:03 pts/0    00:00:00 /sbin/nginx -c /etc/nginx/nginx-proxy-one-worker.conf -g pid /var/run/nginx-worker-1.pid; error_log /var/log/nginx-worker-1-error.log;
root     21214 17533  0 09:03 pts/0    00:00:00 /sbin/nginx -c /etc/nginx/nginx-proxy-one-worker.conf -g pid /var/run/nginx-worker-2.pid; error_log /var/log/nginx-worker-2-error.log;
root     21215 17533  0 09:03 pts/0    00:00:00 /sbin/nginx -c /etc/nginx/nginx-proxy-one-worker.conf -g pid /var/run/nginx-worker-3.pid; error_log /var/log/nginx-worker-3-error.log;
root     21217 17533  0 09:03 pts/0    00:00:00 grep --color=auto nginx
[root@localhost benchmark]# oo:nginx[21215]: Using Onload 20220330 [0]
oo:nginx[21215]: Copyright 2019-2022 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
oo:nginx[21212]: Using Onload 20220330 [1]
oo:nginx[21212]: Copyright 2019-2022 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
oo:nginx[21214]: Using Onload 20220330 [2]
oo:nginx[21214]: Copyright 2019-2022 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
oo:nginx[21213]: Using Onload 20220330 [3]
oo:nginx[21213]: Copyright 2019-2022 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks

[root@localhost benchmark]# onload_stackdump netstat
TCP 0 0 10.19.231.43:80 0.0.0.0:0 LISTEN
TCP 0 0 10.19.231.43:80 0.0.0.0:0 LISTEN
TCP 0 0 10.19.231.43:80 0.0.0.0:0 LISTEN
TCP 0 0 10.19.231.43:80 0.0.0.0:0 LISTEN
[root@localhost benchmark]#
[root@localhost benchmark]# oo:nginx[21345]: netif_tcp_helper_alloc_u: ERROR: Failed to allocate stack (rc=-16)
See kernel messages in dmesg or /var/log/syslog for more details of this failure
oo:nginx[21345]: netif_tcp_helper_alloc_u: ERROR: Failed to allocate stack (rc=-16)
See kernel messages in dmesg or /var/log/syslog for more details of this failure
oo:nginx[21345]: netif_tcp_helper_alloc_u: ERROR: Failed to allocate stack (rc=-16)
See kernel messages in dmesg or /var/log/syslog for more details of this failure

[root@localhost benchmark]# dmesg -c
...
[ 1582.954945] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)
[ 1582.955855] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)
[ 1582.956723] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)
[ 1582.957667] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)
[ 1582.958605] [sfc efrm] efrm_vi_alloc: Out of VI instances with given attributes (-16)
shirshen12 commented 2 years ago

by default, onload does not get into generic mode of XDP in case driver support does is not available. Please dont turn off ZC support for AF_XDP. I also used to get this error @ligong234

shirshen12 commented 2 years ago

I tested the SO_REUSEPORT thing and it works!!!! But yeah multithreaded apps with auto sensing of RSS is not there. So we can work around it this way.