How to solve "Modded driver unavailable. Performance will be low."

NIC model in KVM with SR-IOV , one is 100Gb/s, the other one is 25Gb/s

00:0b.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
00:0c.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
00:0e.0 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
00:10.0 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function

rdma_core version and DPDK version

rdma-core-40.4
dpdk-stable-21.11.5

OS/Kernel

CentOS Linux release 8.5.2111/4.18.0-348.7.1.el8_5.x86_64

I ran into two problems, first I chose to use the dpdk side way with

cmake . -DPERF=OFF -DTRANSPORT=dpdk

Compilation went fine, but I have problems running hello_server:

./hello_server: /lib64/libmlx5.so.1: version `MLX5_1.21' not found (required by ./hello_server)
./hello_server: /lib64/libmlx5.so.1: version `MLX5_1.20' not found (required by ./hello_server)

$ ldd ./hello_server
./hello_server: /lib64/libmlx5.so.1: version `MLX5_1.21' not found (required by ./hello_server)
./hello_server: /lib64/libmlx5.so.1: version `MLX5_1.20' not found (required by ./hello_server)
    linux-vdso.so.1 (0x00007ffec39c9000)
    libmlx5.so.1 => /lib64/libmlx5.so.1 (0x00007f37307fa000)
    libibverbs.so.1 => /lib64/libibverbs.so.1 (0x00007f37305da000)
    libmlx4.so.1 => /lib64/libmlx4.so.1 (0x00007f37303cd000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f37301c9000)
    libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f372ffbd000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f372fda6000)
    libelf.so.1 => /lib64/libelf.so.1 (0x00007f372fb8d000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f372f96d000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f372f5ec000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f372f257000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f372f03f000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f372ec79000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f3730a4d000)
    libnl-route-3.so.200 => /lib64/libnl-route-3.so.200 (0x00007f372e9f3000)
    libnl-3.so.200 => /lib64/libnl-3.so.200 (0x00007f372e7d0000)

$ ll /lib64/libmlx5.so.1
lrwxrwxrwx. 1 root root 20 May 17  2021 /lib64/libmlx5.so.1 -> libmlx5.so.1.19.35.0

Then when I use the Roce pattern with

cmake . -DPERF=off -DTRANSPORT=infiniband -DROCE=on

In this way, hello_client & hello_server can be run normally. But get warning:

Modded driver unavailable. Performance will be low.

My goal is to hit the full 100Gb bandwidth in the production environment. So

Which mode is more recommended, DPDK (UDP) or RoCE (RDMA) in the actual production environment;
How to resolve /lib64/libmlx5.so.1: versionMLX5_1.21' not found`;
How to resolve Modded driver unavailable. Performance will be low to get the best performance;

@anujkaliaiitd Looking forward to your reply, thank you very much!

Hi @ethercflow

I would recommend using the DPDK backend, since that's currently more stable and well-tested
The MLX5 symbol/version issue could be related to multiple versions of libmlx5 on your system. Please check that distro packages like ibverbs-providers etc that install libmlx5 haven't been installed.
The "Modded driver unavailable" issue can be ignored. I used to maintain a modified version of mlx5 drivers that optimize message rate (by ~10%) for small messages (< 128 B), but for most applications these optimizations aren't important.

Both the RoCE and DPDK implementations can get good performance, though we need ~2 cores to reach 100 Gbps. For high bandwidth, increasing the MTU in eRPC's source will improve performance significantly, e.g., https://github.com/erpc-io/eRPC/blob/094c17c3cd9b48bcfbed63f455cc85b9976bd43f/src/transport_impl/dpdk/dpdk_transport.h#L191, and https://github.com/erpc-io/eRPC/blob/094c17c3cd9b48bcfbed63f455cc85b9976bd43f/src/transport_impl/dpdk/dpdk_transport.h#L210. For the latter line, we can use kMTU + 1024 instead of the hard-coded 2048.

Thanks for your reply, it's very helpful to me. :)

Both the RoCE and DPDK implementations can get good performance, though we need ~2 cores to reach 100 Gbps. For high bandwidth, increasing the MTU in eRPC's source will improve performance significantly, e.g.,

https://github.com/erpc-io/eRPC/blob/094c17c3cd9b48bcfbed63f455cc85b9976bd43f/src/transport_impl/dpdk/dpdk_transport.h#L191

, and https://github.com/erpc-io/eRPC/blob/094c17c3cd9b48bcfbed63f455cc85b9976bd43f/src/transport_impl/dpdk/dpdk_transport.h#L210

. For the latter line, we can use kMTU + 1024 instead of the hard-coded 2048.

Hi @anujkaliaiitd , sorry to bother you. I have another question: our IDC's nic and switch's MTU size is 9000. So what is the relationship between the value of kMTU and 9000? What is the maximum value that kMTU can be set?

Hi. You could try:

kMTU = 8192
kMbufSize = static_cast(sizeof(struct rte_mbuf)) + RTE_PKTMBUF_HEADROOM + 8192;

You could tweak the 8192 value a bit, e.g., 8900 could work. I don't recall the exact packet structure or padding needed, but 1000/100 spare bytes should be more than enough.

Hi. You could try:

kMTU = 8192

kMbufSize = static_cast(sizeof(struct rte_mbuf)) + RTE_PKTMBUF_HEADROOM + 8192;

You could tweak the 8192 value a bit, e.g., 8900 could work. I don't recall the exact packet structure or padding needed, but 1000/100 spare bytes should be more than enough.

@ankalia Thanks a lot, I use 8192 with large_rpc_tput to bench throughput with parameters:

--test-ms 3000000
--req-size 1048576
--resp-size 32
--num-processes 2
--num-proc-0-threads 1
--num-proc-other-threads 1
--concurrency 1
--drop-prob 0.0
--profile incast
--throttle 0
--throttle-fraction 0.9
--numa-0-ports 2
--numa-1-ports 3

Got

large_rpc_tput: Thread 0: Tput {RX 0.00 (4857), TX 40.75 (4857)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {196.1 50th, 389.0 99th, 697.6 99.9th}. Timely rate 71.8 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4758), TX 39.92 (4758)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {196.8 50th, 451.8 99th, 783.1 99.9th}. Timely rate 16.0 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4925), TX 41.32 (4925)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {195.2 50th, 346.3 99th, 750.1 99.9th}. Timely rate 53.1 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4871), TX 40.86 (4871)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {196.4 50th, 379.5 99th, 757.2 99.9th}. Timely rate 34.3 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4934), TX 41.39 (4934)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {192.1 50th, 417.9 99th, 791.3 99.9th}. Timely rate 87.3 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (5007), TX 42.01 (5007)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {192.1 50th, 340.1 99th, 710.7 99.9th}. Timely rate 90.7 Gbps. Credits 32 (best = 32).

If I increase --concurrency to 2, Retransmissions will happen:

large_rpc_tput: Thread 0: Tput {RX 0.00 (4927), TX 41.33 (4927)} Gbps (IOPS). Retransmissions 199. Packet RTTs: {-1.0, -1.0} us. RPC latency {193.0 50th, 402.9 99th, 660.2 99.9th}. Timely rate 78.2 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4438), TX 37.23 (4438)} Gbps (IOPS). Retransmissions 198. Packet RTTs: {-1.0, -1.0} us. RPC latency {203.8 50th, 569.4 99th, 894.3 99.9th}. Timely rate 42.6 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4510), TX 37.84 (4510)} Gbps (IOPS). Retransmissions 199. Packet RTTs: {-1.0, -1.0} us. RPC latency {198.0 50th, 583.6 99th, 892.4 99.9th}. Timely rate 41.9 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4519), TX 37.91 (4519)} Gbps (IOPS). Retransmissions 199. Packet RTTs: {-1.0, -1.0} us. RPC latency {202.0 50th, 531.9 99th, 881.6 99.9th}. Timely rate 59.0 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4639), TX 38.92 (4639)} Gbps (IOPS). Retransmissions 199. Packet RTTs: {-1.0, -1.0} us. RPC latency {196.4 50th, 498.1 99th, 793.3 99.9th}. Timely rate 61.7 Gbps. Credits 32 (best = 32).

Increase --concurrency to 16, retransmission will be more serious.

large_rpc_tput: Thread 0: Tput {RX 0.00 (4503), TX 37.78 (4503)} Gbps (IOPS). Retransmissions 1391. Packet RTTs: {-1.0, -1.0} us. RPC latency {1931.0 50th, 3071.7 99th, 3455.7 99.9th}. Timely rate 81.8 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4414), TX 37.03 (4414)} Gbps (IOPS). Retransmissions 1389. Packet RTTs: {-1.0, -1.0} us. RPC latency {1948.4 50th, 3182.2 99th, 3711.9 99.9th}. Timely rate 35.6 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4473), TX 37.53 (4473)} Gbps (IOPS). Retransmissions 1390. Packet RTTs: {-1.0, -1.0} us. RPC latency {1943.4 50th, 3070.6 99th, 3344.7 99.9th}. Timely rate 48.8 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4570), TX 38.34 (4570)} Gbps (IOPS). Retransmissions 1388. Packet RTTs: {-1.0, -1.0} us. RPC latency {1902.9 50th, 3003.2 99th, 4543.0 99.9th}. Timely rate 80.3 Gbps. Credits 32 (best = 32).

Judging from the results, it is not as good as this friend’s test result. I would like to ask what suggestions you have for this? Did I miss something?

Besides that, I want to create more sessions on one rpc, and I have two questions about this:

Is it possible to create more sessions by increasing kNumRxRingEntries, and at what cost?
The code comments mentioned don't create sessions in the data path, so if I create multiple sessions in advance, but only a few of them are currently in use, the unused (will be used in the future) sessions will not occupy the rpc bandwidth, right?

Thanks a lot for your help! cc @anujkaliaiitd

Hi, thanks for sharing the numbers. It's great to know that the code works through KVM and on CentOS :).

To tune single-flow tput, you can allow more credits per flow. That's a static limit on the number of outstanding packets per flow, which can limit perf. https://github.com/erpc-io/eRPC/blob/094c17c3cd9b48bcfbed63f455cc85b9976bd43f/src/sm_types.h#L11

You can also try disabling congestion control by setting this to false: https://github.com/erpc-io/eRPC/blob/094c17c3cd9b48bcfbed63f455cc85b9976bd43f/src/tweakme.h#L16

I don't know why you're seeing packet loss, which shouldn't happen with a single flow. You could try building eRPC with DLOG_LEVEL=trace and sift through the timeout logs output in /tmp.

The post you've linked to seems to use bare-metal NICs, whereas you're using KVM. This can effect performance I assume.

You can create lots of sessions with one Rpc object without changing kNumRxRingEntries. If create_session starts to fail, you can try bumping the RX ring entries.

Unused sessions don't affect performance.

erpc-io / eRPC

How to solve "Modded driver unavailable. Performance will be low." #100