Mellanox / libvma

Linux user space library for network socket acceleration based on RDMA compatible network adaptors
565 stars 152 forks source link

Bonding interfaces cannot be offloaded on Linux Kernel 4.9 #313

Open Maokaman1 opened 7 years ago

Maokaman1 commented 7 years ago


It seems that something has changed in Linux 4.9 regarding the way it represents bonded Mellanox interfaces which leads to broken offloading functionality of VMA for teamed interfaces.

[root@host2 ~]# uname -a Linux host2 4.9.11-1-ARCH #1 SMP PREEMPT Sun Feb 19 13:45:52 UTC 2017 x86_64 GNU/Linux

[root@host2 ~]# sockperf server VMA INFO: --------------------------------------------------------------------------- VMA INFO: VMA_VERSION: 8.2.8-0 Development Snapshot built on Feb 27 2017 17:27:29 VMA INFO: Cmd Line: sockperf server VMA INFO: Current Time: Wed Mar 1 09:59:12 2017 VMA INFO: Pid: 18020 VMA INFO: Architecture: x86_64 VMA INFO: Node: host2 VMA INFO: Log Level INFO [VMA_TRACELEVEL] VMA INFO: --------------------------------------------------------------------------- VMA WARNING: **** VMA WARNING: Your current max locked memory is: 65536. Please change it to unlimited. VMA WARNING: Set this user's default to ulimit -l unlimited. VMA WARNING: Read more about this topic in the VMA's User Manual. VMA WARNING: **** VMA WARNING: VMA WARNING: Bond bond0 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond0 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond0.10 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond0.8 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond0.8 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: ** VMA WARNING: NO IMMEDIATE ACTION NEEDED! VMA WARNING: Not enough hugepage resources for VMA memory allocation. VMA WARNING: VMA will continue working with regular memory allocation. VMA INFO: Optional: VMA INFO: 1. Switch to a different memory allocation type VMA INFO: (VMA_MEM_ALLOC_TYPE= 0 or 1) VMA INFO: 2. Restart process after increasing the number of VMA INFO: hugepages resources in the system: VMA INFO: "cat /proc/meminfo | grep -i HugePage" VMA INFO: "echo 1000000000 > /proc/sys/kernel/shmmax" VMA INFO: "echo 800 > /proc/sys/vm/nr_hugepages" VMA WARNING: Please refer to the memory allocation section in the VMA's VMA WARNING: * User Manual for more information VMA WARNING: *** sockperf: == version #2.7-54.git4e9e71bf405b == sockperf: [SERVER] listen on: [ 0] IP = PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: [tid 18020] using recvfrom() to block on socket(s) ^Csockperf: Test end (interrupted by user) sockperf: No messages were received on the server. sockperf: cleanupAfterLoop() exit [root@host2 ~]#

liranoz12 commented 7 years ago

Hi @Maokaman1 ,

I did not manage to reproduce the issue using kernel 4.9.11, Redhat 6.4 and VMA 828. Do you use Mellanox OFED ? if yes, please try to reinstall it using --vma --add-kernel-support parameters. What is the output of ibstat command ? Can you please attach VMA log with debug log level? (run using VMA_TRACELEVEL=DEBUG).


Maokaman1 commented 7 years ago

Hello @liranoz12 ,

We use Archlinux and it's not supported by Mellanox OFED. So we have only these tools and libs: Arch AUR Infiniband

Unfortunatly I have already returned 2 dual-port MCX416A-CCAT (100Gb, Ethernet Only) adapters that I had requested for a test and I cannot make any additional researches at the moment. I've attached a log that I saved back then (mlx5_bond_0 is pretty suspicious device name).

Now I have only 2 single-port MCX455A-FCAT (56Gb VPI) adapters and I cannot reproduce the problem.

Maokaman1 commented 7 years ago

Hi @liranoz12 ,

Is there any ETA on resolving this dual port adapters issue?

NirNitzani commented 7 years ago

Hi @Maokaman1 ,

We are not familiar with such issue when using Mellanox OFED. Have you been able to obtain a new board and test it with Mellanox OFED ?

Maokaman1 commented 7 years ago

Hi @NirNitzani , I've got a bunch of new MCX456A-ECAT (dual port again) and the problem is still there. According to this community post "HowTo Configure RoCE over LAG (ConnectX-4)" appearance of aggregated mlx5_bond_0 device instead of two separate ones is a typical behaviour if you meet the requirements described in "Setup" section. So it seems that libvma doesn't support so-called "RoCE LAG mode". Can I somehow disable this mode to make libvma work again?

NirNitzani commented 7 years ago

Hi @Maokaman1 ,

VMA is not supporting can work in ETH mode or IPoIB (supported in latest OFED). I suggest starting by using our latest OFED/VMA release ensure that everything is working and only then switch to you specific OS.

Maokaman1 commented 6 years ago

Hi @NirNitzani , Unfortunately CentOS 7.4 with Mellanox OFED installed creates that aggregated mlx5_bond_0 (roce LAG) device too.

# cat /etc/redhat-release CentOS Linux release 7.4.1708 (Core)

# uname -a Linux centos-1.local 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

# modinfo mlx5_ib filename: /lib/modules/3.10.0-693.2.2.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko version: 4.1-1.0.2 license: Dual BSD/GPL description: Mellanox Connect-IB HCA IB driver author: Eli Cohen rhelversion: 7.4 srcversion: D88500BEA6DD3896298C88C depends: mlx5_core,ib_core,mlx_compat vermagic: 3.10.0-693.2.2.el7.x86_64 SMP mod_unload modversions

# /etc/init.d/openibd status

HCA driver loaded

Configured Mellanox EN devices: mlx0 mlx1

Currently active Mellanox devices: mlx0 mlx1

The following OFED modules are loaded:

rdma_ucm rdma_cm ib_ipoib mlx4_core mlx4_ib mlx4_en mlx5_core mlx5_ib ib_uverbs ib_umad ib_ucm ib_cm ib_core

# ibstat CA 'mlx5_bond_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.20.1010 Hardware version: 0 Node GUID: 0x248a070300b1bcd8 System image GUID: 0x248a070300b1bcd8 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x04010000 Port GUID: 0x268a07fffeb1bcd8 Link layer: Ethernet

# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: net0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 70:4d:7b:63:25:c7 brd ff:ff:ff:ff:ff:ff inet brd scope global net0 valid_lft forever preferred_lft forever inet6 fe80::724d:7bff:fe63:25c7/64 scope link valid_lft forever preferred_lft forever 7: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 02:56:fd:62:fd:1d brd ff:ff:ff:ff:ff:ff 8: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000 link/ether 24:8a:07:b1:bc:d8 brd ff:ff:ff:ff:ff:ff inet brd scope global bond1 valid_lft forever preferred_lft forever inet brd scope global secondary bond1 valid_lft forever preferred_lft forever inet6 fe80::268a:7ff:feb1:bcd8/64 scope link valid_lft forever preferred_lft forever 9: mlx0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP qlen 1000 link/ether 24:8a:07:b1:bc:d8 brd ff:ff:ff:ff:ff:ff 10: mlx1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP qlen 1000 link/ether 24:8a:07:b1:bc:d8 brd ff:ff:ff:ff:ff:ff

[root@centos-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-mlx0 DEVICE=mlx0 BOOTPROTO=none ONBOOT=yes MASTER=bond1 SLAVE=yes USERCTL=no

[root@centos-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-mlx1 DEVICE=mlx1 BOOTPROTO=none ONBOOT=yes MASTER=bond1 SLAVE=yes USERCTL=no

[root@centos-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond1 DEVICE=bond1 BONDING_OPTS="mode=4 miimon=100 fail_over_mac=0" BOOTPROTO=none ONBOOT=yes IPADDR0= PREFIX0="24" IPADDR1= PREFIX1="24" USERCTL=no

libvma 8.3.7 bundled with MLNX_OFED: [root@centos-1 ~]# LD_PRELOAD=/usr/lib64/ sockperf sr VMA INFO: --------------------------------------------------------------------------- VMA INFO: VMA_VERSION: 8.3.7-0 Release built on Aug 2 2017 03:21:48 VMA INFO: Cmd Line: sockperf sr VMA INFO: OFED Version: MLNX_OFED_LINUX-4.1- VMA INFO: Log Level INFO [VMA_TRACELEVEL] VMA INFO: --------------------------------------------------------------------------- VMA WARNING: VMA WARNING: Bond bond1 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond1 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: sockperf: == version #3.1-16.gitc6a0d0e3ab53 == sockperf: [SERVER] listen on: [ 0] IP = PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: [tid 5212] using recvfrom() to block on socket(s)

libvma 8.4.4 compiled from git: [root@centos-1 ~]# LD_PRELOAD=/usr/lib64/ sockperf sr VMA INFO: --------------------------------------------------------------------------- VMA INFO: VMA_VERSION: 8.4.4-0 Development Snapshot built on Sep 18 2017 14:06:27 VMA INFO: Git: d2c8f241619549dc115cd90865b318f93ad70c46 VMA INFO: Cmd Line: sockperf sr VMA INFO: Current Time: Mon Sep 18 16:21:48 2017 VMA INFO: Pid: 5384 VMA INFO: OFED Version: MLNX_OFED_LINUX-4.1- VMA INFO: Architecture: x86_64 VMA INFO: Node: centos-1.local VMA INFO: --------------------------------------------------------------------------- VMA INFO: Log Level INFO [VMA_TRACELEVEL] VMA INFO: --------------------------------------------------------------------------- VMA WARNING: VMA WARNING: Bond bond1 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond1 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: ** VMA WARNING: NO IMMEDIATE ACTION NEEDED! VMA WARNING: Not enough hugepage resources for VMA memory allocation. VMA WARNING: VMA will continue working with regular memory allocation. VMA INFO: Optional: VMA INFO: 1. Switch to a different memory allocation type VMA INFO: (VMA_MEM_ALLOC_TYPE!= 2) VMA INFO: 2. Restart process after increasing the number of VMA INFO: hugepages resources in the system: VMA INFO: "echo 1000000000 > /proc/sys/kernel/shmmax" VMA INFO: "echo 800 > /proc/sys/vm/nr_hugepages" VMA WARNING: Please refer to the memory allocation section in the VMA's VMA WARNING: User Manual for more information VMA WARNING: *** sockperf: == version #3.1-16.gitc6a0d0e3ab53 == sockperf: [SERVER] listen on: [ 0] IP = PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: [tid 5384] using recvfrom() to block on socket(s)

Debug mode on:

liranoz12 commented 6 years ago

Hi @Maokaman1,

Thanks for your informative update. It is a known issue while using VMA with CentOS 7.4. Starting in kernel version 3.10.0-693 (7.4 kernel), in case of creating a bond LAG consisting of precisely two ports, the bond will not be offloaded if both ports belong to a single device. Workaround: In case of creating a bond LAG there should be at least two ports belonging to different devices enslaved under the bond. A fix for this issue is in our roadmap.


Maokaman1 commented 6 years ago

Hi @liranoz12,

I've found another workaround that seems to work even on dual-port adapters: you just need to create a "dummy" bridge interface on top of the bond interface (also do not forget to migrate IP address(es) from the bond to the bridge interface). Not sure if that's a production ready workaround, but, nevetheless, one can find this information useful.

liranoz12 commented 6 years ago


Thanks for your update. We will check this workaround. Liran.

DanielLibenson commented 6 years ago

Hi @Maokaman1, Thank you for your hint, "dummy" bridge is a good and working workaround for mlx5 devices. Also you may use a "dummy" interface as an alternative workaround. We will update our release notes accordingly.
