Xilinx-CNS / onload

OpenOnload high performance user-level network stack
Other
555 stars 90 forks source link

ixgbe: __oof_socket_add_wild: 1:2047 ERROR: FILTER TCP 10.194.139.66:1 0.0.0.0:0 failed #19

Open shijiesun opened 3 years ago

shijiesun commented 3 years ago

I can not start onload in centos 8.3 and debian10(kernel 5.8), why onload need to find mtdchar? errors follows:

[root@localhost onload]# "$(mmaketool --toppath)/build/$(mmaketool --driverbuild)/driver/linux/load.sh" onload
unload.sh: /sbin/rmmod onload
unload.sh: /sbin/rmmod sfc_char
unload.sh: /sbin/rmmod sfc_resource
unload.sh: /sbin/rmmod sfc
unload.sh: /sbin/rmmod virtual_bus
unload.sh: /sbin/rmmod sfc_driverlink
NET_OPT is
CHAR_OPT is
modprobe: FATAL: Module mtdchar not found in directory /lib/modules/4.18.0-240.10.1.el8_3.x86_64
ERROR: Did not find sfc_control in /proc/devices
sfc is a DEBUG driver
RESOURCE_OPT is
CHAR_OPT is
ONLOAD_OPT is
ol-alexandra commented 3 years ago

It doesn't need mtdchar really, for the most of kernels. What do you mean when you say "I can not start Onload"?

I agree that it's better to fix load.sh to avoid printing unimportant errors, but it is a developer's tool. Do you see any real issue with Onload? Which application do you use? Does it work with Onload?

shijiesun commented 3 years ago

Thanks for your reply! you mean the errors when load drivers donot matters? The situation is: when loading drivers using "load.sh onload", I found errors that I memtioned last time:

modprobe: FATAL: Module mtdchar not found in directory /lib/modules/4.18.0-240.10.1.el8_3.x86_64
ERROR: Did not find sfc_control in /proc/devices

then I try to use onload lib by call "scripts/onload" or "LD_PRELOAD" to speed up my app , it faileds! logs follow:

ssj@ssj-debian10:~/github/sutn$ LD_PRELOAD="$(mmaketool --toppath)/build/$(mmaketool --userbuild)/lib/transport/unix/libcitransport0.so" sockperf sr -i 192.168.56.102 -p 1233 --tcp
citp_oo_get_cpu_khz: Failed to open /dev/onload
oo:sockperf[18408]: __citp_netif_alloc: failed to open driver (1)
oo:sockperf[18408]: citp_netif_alloc_and_init: failed to create netif (1)
oo:sockperf[18408]: citp_tcp_socket: failed (errno:1) - PASSING TO OS
oo_onloadfs_dev_t: Failed to open /dev/onload
sockperf: == version #3.7-1.gitb741ab3c60b1 ==
sockperf: [SERVER] listen on:
[ 0] IP = 192.168.56.102  PORT =  1233 # TCP

os version

root@ssj-debian10:/home/ssj/github/onload# uname -a
Linux ssj-debian10 5.8.0-0.bpo.2-amd64 #1 SMP Debian 5.8.10-1~bpo10+1 (2020-09-26) x86_64 GNU/Linux

or

[root@bogon dev]# uname -a
Linux bogon 4.18.0-240.10.1.el8_3.x86_64 #1 SMP Mon Jan 18 17:05:51 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@bogon dev]# cat /etc/centos-release
CentOS Linux release 8.3.2011
[root@bogon dev]#

I found onload the day before yesterday, I want to test the effect by "sockperf", then I would use it in distributed systems to lower the cost in network, lower latency and enhance the throutput! I need your help. Thank you very much!

ol-alexandra commented 3 years ago

Are you using Solarflare NICs? Or AF_XDP? Have you read https://github.com/Xilinx-CNS/onload#installation-and-quick-start-guide:

echo ens2f0 > /sys/module/sfc_resource/afxdp/register
h2cw2l commented 3 years ago

Are you using Solarflare NICs? Or AF_XDP? Have you read https://github.com/Xilinx-CNS/onload#installation-and-quick-start-guide:

echo ens2f0 > /sys/module/sfc_resource/afxdp/register

Thanks for your reply. We don not use Solarflare NICs, we just want to test AF_XDP.

1、Yes, I have excuted this cmd. echo ens2f0 > /sys/module/sfc_resource/afxdp/register

2、The infomation of my env is as follows. If onload can run on this device ? If yes, what steps should I obey ? ( 1 ) NIC: [root@A03-R05-I139-66-FVP3HP2 ~]# ethtool -i eth0 driver: ixgbe version: 5.1.0-k-rh8.2.0 firmware-version: 0x8000090c, 18.3.6 expansion-rom-version: bus-info: 0000:01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes ( 2 ) OS [root@A03-R05-I139-66-FVP3HP2 ~]# uname -r 4.18.0-240.10.1.el8_3.x86_64 [root@A03-R05-I139-66-FVP3HP2 ~]# cat /etc/centos-release CentOS Linux release 8.2.2004 (Core)

Any reply will be greatly appreciated. Thank you very much.

ol-alexandra commented 3 years ago
  1. Did onload module loaded? Let me repeat, the complains from load.sh are non-fatal. Please share lsmod | grep sfc & lsmod |grep onload.
  2. What do you see in dmesg?
  3. (if onload module has been loaded) Is /dev/onload here? load.sh usually creates it.
h2cw2l commented 3 years ago
1. Did onload module loaded?  Let me repeat, the complains from `load.sh` are non-fatal.  Please share `lsmod | grep sfc` & `lsmod |grep onload`.

2. What do you see in dmesg?

3. (if onload module has been loaded) Is `/dev/onload` here?  `load.sh` usually creates it.

Thank you for reply. Yes, we now do not care 'load.sh'.

  1. Information of lsmod | grep sfc & lsmod |grep onload are as follow. Is there any thing wrong ? [root@A03-R05-I139-66-FVP3HP2 onload]# lsmod | grep sfc sfc_char 106496 1 onload sfc_resource 180224 2 onload,sfc_char sfc 524288 0 virtual_bus 16384 1 sfc sfc_driverlink 16384 2 sfc,sfc_resource vdpa 16384 1 sfc mtd 69632 1 sfc mdio 16384 2 sfc,ixgbe [root@A03-R05-I139-66-FVP3HP2 onload]# lsmod | grep onload onload 794624 4 sfc_char 106496 1 onload sfc_resource 180224 2 onload,sfc_char [root@A03-R05-I139-66-FVP3HP2 onload]#

  2. What do you see in dmesg ? NDEBUG: [ 1365.686192] [onload] [1]: WARNING: huge pages are incompatible with AF_XDP. Disabling hugepage support. [ 1365.716153] [onload] oof_socket_add_wild: 1:2047 ERROR: FILTER TCP 10.194.139.66:1 0.0.0.0:0 failed (-95) [ 1365.718787] [sfc efhw] af_xdp_flush_rx_dma_channel: FIXME AF_XDP [ 1365.719212] [sfc efhw] af_xdp_flush_tx_dma_channel: FIXME AF_XDP DEBUG: [ 9653.298932] [onload] [6]: WARNING: huge pages are incompatible with AF_XDP. Disabling hugepage support. [ 9653.347207] [onload] oof_socket_add_wild: 6:2047 ERROR: FILTER TCP 10.194.139.67:1500 0.0.0.0:0 failed (-95) [ 9653.360091] [sfc efrm] efrm_pt_flush: [rs:0,00000000dbd0b04d] EVQ=2048 TXQ=512 RXQ=512 [ 9653.360094] [sfc efrm] efrm_vi_resource_issue_flush: rx queue 0 flush requested for nic 0 [ 9653.360096] [sfc efhw] af_xdp_flush_rx_dma_channel: FIXME AF_XDP [ 9653.365175] [sfc efrm] Flushed queue nic 0 type 1 0x0 rc -95 [ 9653.365197] [sfc efrm] efrm_vi_resource_issue_flush: tx queue 0 flush requested for nic 0 [ 9653.365198] [sfc efhw] af_xdp_flush_tx_dma_channel: FIXME AF_XDP [ 9653.370204] [sfc efrm] Flushed queue nic 0 type 0 0x0 rc -95 [ 9653.370252] [sfc efrm] efrm_vi_rm_delayed_free: 00000000f515d7eb [ 9653.370253] [sfc efrm] efrm_vi_rm_delayed_free: flushed VI instance=0 [ 9653.370295] [sfc efrm] efrm_vi_rm_free_flushed_resource: [rs:0,00000000dbd0b04d] [ 9653.370296] [sfc efrm] __efrm_vi_resource_free: Freeing 0 [ 9653.370320] [sfc efrm] Flushed queue nic 0 type 2 0x0 rc 0

  3. (if onload module has been loaded) Is /dev/onload here? load.sh usually creates it. ---- Yes, onload is here. [root@A03-R05-I139-66-FVP3HP2 onload]# ls /dev/ | grep onload onload onload_epoll [root@A03-R05-I139-66-FVP3HP2 onload]#

Any reply will be greatly appreciated.

h2cw2l commented 3 years ago

@ol-alexandra Hi, can you reproduce this issue in your local env ? Thank you.

[ 9653.298932] [onload] [6]: WARNING: huge pages are incompatible with AF_XDP. Disabling hugepage support. [ 9653.347207] [onload] __oof_socket_add_wild: 6:2047 ERROR: FILTER TCP 10.194.139.67:1500 0.0.0.0:0 failed (-95) [ 9653.360091] [sfc efrm] efrm_pt_flush: [rs:0,00000000dbd0b04d] EVQ=2048 TXQ=512 RXQ=512 [ 9653.360094] [sfc efrm] efrm_vi_resource_issue_flush: rx queue 0 flush requested for nic 0 [ 9653.360096] [sfc efhw] af_xdp_flush_rx_dma_channel: FIXME AF_XDP [ 9653.365175] [sfc efrm] Flushed queue nic 0 type 1 0x0 rc -95 [ 9653.365197] [sfc efrm] efrm_vi_resource_issue_flush: tx queue 0 flush requested for nic 0 [ 9653.365198] [sfc efhw] af_xdp_flush_tx_dma_channel: FIXME AF_XDP [ 9653.370204] [sfc efrm] Flushed queue nic 0 type 0 0x0 rc -95

ol-alexandra commented 3 years ago

No, I can not reproduce it because I do not have ixgbe NICs.

h2cw2l commented 3 years ago

No, I can not reproduce it because I do not have ixgbe NICs.

Ok, I know.

1、Shall you give me some advices to solve this problem ?

2、Do you know who run onload on ixgbe successful ?

Thanks.

abower-amd commented 3 years ago

Hi, Onload with non-Solarflare NICs is a community-supported capability. Since you are experiencing an error related to filtering I think you need to follow up this suggestion further: https://github.com/Xilinx-CNS/onload/issues/10#issuecomment-785929182

h2cw2l commented 3 years ago

Hi, Onload with non-Solarflare NICs is a community-supported capability. Since you are experiencing an error related to filtering I think you need to follow up this suggestion further: #10 (comment)

Yes, I have also noticed this issue, but i did not understand what did @maciejj-xilinx mean for "ethtool --features enp4s0f0 ntuple".

kieranm-xilinx commented 3 years ago

Yes, I have also noticed this issue, but i did not understand what did @maciejj-xilinx mean for "ethtool --features enp4s0f0 ntuple".

When running on non-Solarflare NICs, Onload relies on the NIC's driver supporting ntuple filters. The Intel driver supports enabling this through the ethtool command that is commonly used to configure network interface properties. By running the ethtool command given, using the correct network interface name for your system in place of enp4s0f0, you can turn on Intel's ntuple filtering support in their driver.

h2cw2l commented 3 years ago

Yes, I have also noticed this issue, but i did not understand what did @maciejj-xilinx mean for "ethtool --features enp4s0f0 ntuple".

When running on non-Solarflare NICs, Onload relies on the NIC's driver supporting ntuple filters. The Intel driver supports enabling this through the ethtool command that is commonly used to configure network interface properties. By running the ethtool command given, using the correct network interface name for your system in place of enp4s0f0, you can turn on Intel's ntuple filtering support in their driver.

It is ok, thank you very much.

cmd: ethtool -K eth0 ntuple on

sundbp commented 3 years ago

Hi,

I'm seeing the same/similar error here using an igb driver card on kernel 5.11. It got AF_XDP support in 5.10 (see here here).

My interface is added to /sys/module/sfc_resource/afxdp/register and I've turned on the ntuple flag for the interface.

If I try to run e.g. sudo ./onload nc -l 9898 I see:

↳ sudo ./onload nc -l 9898                 
oo:nc[2289307]: Using Onload 20210611 [2]
oo:nc[2289307]: Copyright 2019-2021 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
nc: listen: Invalid argument

And syslog shows:

Jun 11 11:10:58 monster kernel: [onload] [2]: WARNING: huge pages are incompatible with AF_XDP. Disabling hugepage support.
Jun 11 11:10:58 monster kernel: [onload] __oof_socket_add_wild: 2:2047 ERROR: FILTER TCP 10.10.10.11:9898 0.0.0.0:0 failed (-22)
Jun 11 11:10:58 monster kernel: [sfc efhw] af_xdp_flush_rx_dma_channel: FIXME AF_XDP
Jun 11 11:10:58 monster kernel: [sfc efhw] af_xdp_flush_tx_dma_channel: FIXME AF_XDP

If I try to run instead iperf it also reports a similar error (but does not exit, it is not possible to connect to it):

↳ sudo ./onload iperf -s  
oo:iperf[2290537]: Using Onload 20210611 [3]
oo:iperf[2290537]: Copyright 2019-2021 Xilinx, 2006-2019 Solarflare Communications, 2002-2005 Level 5 Networks
listen failed: Invalid argument
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------

the syslog is similar:

Jun 11 11:12:30 monster kernel: [onload] [3]: WARNING: huge pages are incompatible with AF_XDP. Disabling hugepage support.
Jun 11 11:12:30 monster kernel: [onload] __oof_socket_add_wild: 3:2047 ERROR: FILTER TCP 10.10.10.11:5001 0.0.0.0:0 failed (-22)

My kernel version:

Linux monster 5.11.22-2-MANJARO #1 SMP PREEMPT Fri May 21 17:45:54 UTC 2021 x86_64 GNU/Linux

running from git repo as of sha 4267b166ea37d4d780160003a65029422fbd476a.

Happy to assist in any further information gathering to help figure out what's going on!

maciejj-xilinx commented 3 years ago

Hello sundbp,

thanks for detailed report.

We have not tested yet AF_XDP with 5.10 or 5.11 yet, but we do not know any reason AF_XDP with ixdbe would fail there.

There is one thing worth checking.

With ixgbe devices there are some restrictions - e.g. only one filter type can get installed on the NIC and presence of one type of filters will prevent other types of filters to be inserted.

I was wondering what would be the outcome of attempting to manually insert a filter of the type that Onload uses. This should be the command line to achieve this: sudo ethtool -U ethX flow-type tcp4 dst-ip 10.10.10.11 dst-port 9898 action 1

Also are there any filters installed on the NIC? This could be listed with: sudo ethtool -u ethX

What is the outcome of running the command?

ol-alexandra commented 3 years ago

We have not tested yet AF_XDP with 5.10 or 5.11 yet,

We did. It works. But we tested with SFC NICs only (which is completely useless from any normal user point of view).

maciejj-xilinx commented 3 years ago

I'm seeing the same/similar error here using an igb driver card on kernel 5.11.

Just noticed that you actually mentioned igb driver. With Intel we have tested with ixgbe but not igb.

The support for ntuple filters on igb devices might be limited or non-existent.

Worth checking feature list: ethtool --show-features ethX| grep ntuple

and the filter insertion command suggested above to establish whether device support for ntuple is at required level.

sundbp commented 3 years ago
 sudo ethtool -u enp68s0                  
2 RX rings available
Total 0 rules

And:

 sudo ethtool --show-features enp68s0| grep ntuple
ntuple-filters: on

This is more interesting:

↳ sudo ethtool -U enp68s0 flow-type tcp4 dst-ip 10.10.10.11 dst-port 9898 action 1
rmgr: Cannot insert RX class rule: Invalid argument

I don't see anything syslog.

I found this: https://software.intel.com/content/www/us/en/develop/articles/setting-up-intel-ethernet-flow-director.html

Suggests that perhaps the output from ethtool saying it's available and on is false?

sundbp commented 3 years ago

Datasheet of relevance with a flow director section - can't read if it is enough or not: https://cdrdv2.intel.com/v1/dl/getContent/333017