Mellanox / libvma

Linux user space library for network socket acceleration based on RDMA compatible network adaptors
https://www.mellanox.com/products/software/accelerator-software/vma?mtag=vma
Other
565 stars 152 forks source link

Bonding interfaces cannot be offloaded on Linux Kernel 4.9 #313

Open Maokaman1 opened 7 years ago

Maokaman1 commented 7 years ago

Hello!

It seems that something has changed in Linux 4.9 regarding the way it represents bonded Mellanox interfaces which leads to broken offloading functionality of VMA for teamed interfaces.

[root@host2 ~]# uname -a Linux host2 4.9.11-1-ARCH #1 SMP PREEMPT Sun Feb 19 13:45:52 UTC 2017 x86_64 GNU/Linux

[root@host2 ~]# LD_PRELOAD=libvma.so sockperf server VMA INFO: --------------------------------------------------------------------------- VMA INFO: VMA_VERSION: 8.2.8-0 Development Snapshot built on Feb 27 2017 17:27:29 VMA INFO: Cmd Line: sockperf server VMA INFO: Current Time: Wed Mar 1 09:59:12 2017 VMA INFO: Pid: 18020 VMA INFO: Architecture: x86_64 VMA INFO: Node: host2 VMA INFO: Log Level INFO [VMA_TRACELEVEL] VMA INFO: --------------------------------------------------------------------------- VMA WARNING: **** VMA WARNING: Your current max locked memory is: 65536. Please change it to unlimited. VMA WARNING: Set this user's default to ulimit -l unlimited. VMA WARNING: Read more about this topic in the VMA's User Manual. VMA WARNING: **** VMA WARNING: VMA WARNING: Bond bond0 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond0 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond0.10 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond0.8 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond0.8 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: ** VMA WARNING: NO IMMEDIATE ACTION NEEDED! VMA WARNING: Not enough hugepage resources for VMA memory allocation. VMA WARNING: VMA will continue working with regular memory allocation. VMA INFO: Optional: VMA INFO: 1. Switch to a different memory allocation type VMA INFO: (VMA_MEM_ALLOC_TYPE= 0 or 1) VMA INFO: 2. Restart process after increasing the number of VMA INFO: hugepages resources in the system: VMA INFO: "cat /proc/meminfo | grep -i HugePage" VMA INFO: "echo 1000000000 > /proc/sys/kernel/shmmax" VMA INFO: "echo 800 > /proc/sys/vm/nr_hugepages" VMA WARNING: Please refer to the memory allocation section in the VMA's VMA WARNING: * User Manual for more information VMA WARNING: *** sockperf: == version #2.7-54.git4e9e71bf405b == sockperf: [SERVER] listen on: [ 0] IP = 0.0.0.0 PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: [tid 18020] using recvfrom() to block on socket(s) ^Csockperf: Test end (interrupted by user) sockperf: No messages were received on the server. sockperf: cleanupAfterLoop() exit [root@host2 ~]#

liranoz12 commented 7 years ago

Hi @Maokaman1 ,

I did not manage to reproduce the issue using kernel 4.9.11, Redhat 6.4 and VMA 828. Do you use Mellanox OFED ? if yes, please try to reinstall it using --vma --add-kernel-support parameters. What is the output of ibstat command ? Can you please attach VMA log with debug log level? (run using VMA_TRACELEVEL=DEBUG).

Thanks.

Maokaman1 commented 7 years ago

Hello @liranoz12 ,

We use Archlinux and it's not supported by Mellanox OFED. So we have only these tools and libs: Arch AUR Infiniband

Unfortunatly I have already returned 2 dual-port MCX416A-CCAT (100Gb, Ethernet Only) adapters that I had requested for a test and I cannot make any additional researches at the moment. I've attached a log that I saved back then (mlx5_bond_0 is pretty suspicious device name).

Now I have only 2 single-port MCX455A-FCAT (56Gb VPI) adapters and I cannot reproduce the problem.

Maokaman1 commented 7 years ago

Hi @liranoz12 ,

Is there any ETA on resolving this dual port adapters issue?

NirNitzani commented 7 years ago

Hi @Maokaman1 ,

We are not familiar with such issue when using Mellanox OFED. Have you been able to obtain a new board and test it with Mellanox OFED ?

Maokaman1 commented 7 years ago

Hi @NirNitzani , I've got a bunch of new MCX456A-ECAT (dual port again) and the problem is still there. According to this community post "HowTo Configure RoCE over LAG (ConnectX-4)" appearance of aggregated mlx5_bond_0 device instead of two separate ones is a typical behaviour if you meet the requirements described in "Setup" section. So it seems that libvma doesn't support so-called "RoCE LAG mode". Can I somehow disable this mode to make libvma work again?

NirNitzani commented 7 years ago

Hi @Maokaman1 ,

VMA is not supporting ROCE.....you can work in ETH mode or IPoIB (supported in latest OFED). I suggest starting by using our latest OFED/VMA release ensure that everything is working and only then switch to you specific OS.

Maokaman1 commented 6 years ago

Hi @NirNitzani , Unfortunately CentOS 7.4 with Mellanox OFED installed creates that aggregated mlx5_bond_0 (roce LAG) device too.

# cat /etc/redhat-release CentOS Linux release 7.4.1708 (Core)

# uname -a Linux centos-1.local 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

# modinfo mlx5_ib filename: /lib/modules/3.10.0-693.2.2.el7.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko version: 4.1-1.0.2 license: Dual BSD/GPL description: Mellanox Connect-IB HCA IB driver author: Eli Cohen eli@mellanox.com rhelversion: 7.4 srcversion: D88500BEA6DD3896298C88C depends: mlx5_core,ib_core,mlx_compat vermagic: 3.10.0-693.2.2.el7.x86_64 SMP mod_unload modversions


# /etc/init.d/openibd status

HCA driver loaded

Configured Mellanox EN devices: mlx0 mlx1

Currently active Mellanox devices: mlx0 mlx1

The following OFED modules are loaded:

rdma_ucm rdma_cm ib_ipoib mlx4_core mlx4_ib mlx4_en mlx5_core mlx5_ib ib_uverbs ib_umad ib_ucm ib_cm ib_core


# ibstat CA 'mlx5_bond_0' CA type: MT4115 Number of ports: 1 Firmware version: 12.20.1010 Hardware version: 0 Node GUID: 0x248a070300b1bcd8 System image GUID: 0x248a070300b1bcd8 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x04010000 Port GUID: 0x268a07fffeb1bcd8 Link layer: Ethernet


# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: net0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 70:4d:7b:63:25:c7 brd ff:ff:ff:ff:ff:ff inet 192.168.110.145/24 brd 192.168.110.255 scope global net0 valid_lft forever preferred_lft forever inet6 fe80::724d:7bff:fe63:25c7/64 scope link valid_lft forever preferred_lft forever 7: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 02:56:fd:62:fd:1d brd ff:ff:ff:ff:ff:ff 8: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000 link/ether 24:8a:07:b1:bc:d8 brd ff:ff:ff:ff:ff:ff inet 10.17.17.2/24 brd 10.17.17.255 scope global bond1 valid_lft forever preferred_lft forever inet 10.17.17.20/24 brd 10.17.17.255 scope global secondary bond1 valid_lft forever preferred_lft forever inet6 fe80::268a:7ff:feb1:bcd8/64 scope link valid_lft forever preferred_lft forever 9: mlx0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP qlen 1000 link/ether 24:8a:07:b1:bc:d8 brd ff:ff:ff:ff:ff:ff 10: mlx1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP qlen 1000 link/ether 24:8a:07:b1:bc:d8 brd ff:ff:ff:ff:ff:ff


[root@centos-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-mlx0 DEVICE=mlx0 BOOTPROTO=none ONBOOT=yes MASTER=bond1 SLAVE=yes USERCTL=no

[root@centos-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-mlx1 DEVICE=mlx1 BOOTPROTO=none ONBOOT=yes MASTER=bond1 SLAVE=yes USERCTL=no

[root@centos-1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond1 DEVICE=bond1 BONDING_OPTS="mode=4 miimon=100 fail_over_mac=0" BOOTPROTO=none ONBOOT=yes IPADDR0=10.17.17.2 PREFIX0="24" IPADDR1=10.17.17.20 PREFIX1="24" USERCTL=no


libvma 8.3.7 bundled with MLNX_OFED: [root@centos-1 ~]# LD_PRELOAD=/usr/lib64/libvma.so.8.3.7 sockperf sr VMA INFO: --------------------------------------------------------------------------- VMA INFO: VMA_VERSION: 8.3.7-0 Release built on Aug 2 2017 03:21:48 VMA INFO: Cmd Line: sockperf sr VMA INFO: OFED Version: MLNX_OFED_LINUX-4.1-1.0.2.0: VMA INFO: Log Level INFO [VMA_TRACELEVEL] VMA INFO: --------------------------------------------------------------------------- VMA WARNING: VMA WARNING: Bond bond1 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond1 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: sockperf: == version #3.1-16.gitc6a0d0e3ab53 == sockperf: [SERVER] listen on: [ 0] IP = 0.0.0.0 PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: [tid 5212] using recvfrom() to block on socket(s)


libvma 8.4.4 compiled from git: [root@centos-1 ~]# LD_PRELOAD=/usr/lib64/libvma.so.8.4.4 sockperf sr VMA INFO: --------------------------------------------------------------------------- VMA INFO: VMA_VERSION: 8.4.4-0 Development Snapshot built on Sep 18 2017 14:06:27 VMA INFO: Git: d2c8f241619549dc115cd90865b318f93ad70c46 VMA INFO: Cmd Line: sockperf sr VMA INFO: Current Time: Mon Sep 18 16:21:48 2017 VMA INFO: Pid: 5384 VMA INFO: OFED Version: MLNX_OFED_LINUX-4.1-1.0.2.0: VMA INFO: Architecture: x86_64 VMA INFO: Node: centos-1.local VMA INFO: --------------------------------------------------------------------------- VMA INFO: Log Level INFO [VMA_TRACELEVEL] VMA INFO: --------------------------------------------------------------------------- VMA WARNING: VMA WARNING: Bond bond1 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: VMA WARNING: Bond bond1 will not be offloaded due to problem with it's slaves. VMA WARNING: Check warning messages for more information. VMA WARNING: VMA WARNING: ** VMA WARNING: NO IMMEDIATE ACTION NEEDED! VMA WARNING: Not enough hugepage resources for VMA memory allocation. VMA WARNING: VMA will continue working with regular memory allocation. VMA INFO: Optional: VMA INFO: 1. Switch to a different memory allocation type VMA INFO: (VMA_MEM_ALLOC_TYPE!= 2) VMA INFO: 2. Restart process after increasing the number of VMA INFO: hugepages resources in the system: VMA INFO: "echo 1000000000 > /proc/sys/kernel/shmmax" VMA INFO: "echo 800 > /proc/sys/vm/nr_hugepages" VMA WARNING: Please refer to the memory allocation section in the VMA's VMA WARNING: User Manual for more information VMA WARNING: *** sockperf: == version #3.1-16.gitc6a0d0e3ab53 == sockperf: [SERVER] listen on: [ 0] IP = 0.0.0.0 PORT = 11111 # UDP sockperf: Warmup stage (sending a few dummy messages)... sockperf: [tid 5384] using recvfrom() to block on socket(s)

Debug mode on: debug_libvma.so.8.4.4.txt

liranoz12 commented 6 years ago

Hi @Maokaman1,

Thanks for your informative update. It is a known issue while using VMA with CentOS 7.4. Starting in kernel version 3.10.0-693 (7.4 kernel), in case of creating a bond LAG consisting of precisely two ports, the bond will not be offloaded if both ports belong to a single device. Workaround: In case of creating a bond LAG there should be at least two ports belonging to different devices enslaved under the bond. A fix for this issue is in our roadmap.

Liran.

Maokaman1 commented 6 years ago

Hi @liranoz12,

I've found another workaround that seems to work even on dual-port adapters: you just need to create a "dummy" bridge interface on top of the bond interface (also do not forget to migrate IP address(es) from the bond to the bridge interface). Not sure if that's a production ready workaround, but, nevetheless, one can find this information useful.

liranoz12 commented 6 years ago

@Maokaman1,

Thanks for your update. We will check this workaround. Liran.

DanielLibenson commented 6 years ago

Hi @Maokaman1, Thank you for your hint, "dummy" bridge is a good and working workaround for mlx5 devices. Also you may use a "dummy" interface as an alternative workaround. We will update our release notes accordingly.

Daniel