Xilinx / open-nic-driver

AMD OpenNIC driver includes the Linux kernel driver
GNU General Public License v2.0
58 stars 40 forks source link

Low throughput when use this driver #38

Open yrpang opened 1 year ago

yrpang commented 1 year ago

I tried to use a 100Gbps QSFP28 DAC cable to connect two U50s and ran an iperf speed test, but the result was only about 26Gbps. Is this expected? What is the maximum speed that the driver can achieve?

cneely-amd commented 1 year ago

Hi @yrpang, The Linux networking stack has some CPU overhead. Performance measuring requires some amount of experimentation can vary depending on your machines and how many cores that you use. It's hard for a single core to saturate 100G link. To measure performance you might want to try something like:

Run n=8 iperf3 processes, each 5Gb/s target and sum them to measure a total of: e.g. ~32 Gb/s , where each process is like the following:

taskset -c 7 iperf3 -c 192.168.20.2 -p 36696 --bind 192.168.20.4 --cport 35986 -t 40 -b 5G > iperf_client_7.log &

Or similarly, with n=16 and each at 5 Gb/s, and you might measure e.g. a total of ~52 Gb/s

(iperf3 is single threaded)

These numbers are just based on one setup that I used a while back. With the DPDK driver and pktgen you can more easily reach line rate depending on the capabilities of your machines.

--Chris

yrpang commented 1 year ago

Really thank you for your reply.

The 26Gbps result was obtained with iperf2 iperf version 2.0.5 (2 June 2018) pthreads, use iperf -c 192.168.4.2 -P 4 on client side and iperf -s on server side.

And I've tried to use iperf3 as the following:

  1. Start 5 iperf3 server listen to 5 different ports
  2. Use the following script to start 5 iperf3 client and each bound to a CPU.
    taskset -c 1 iperf3 --client 192.168.4.2 -p 5202 -t 40 -b 20G > iperf_client_1.log &
    taskset -c 2 iperf3 --client 192.168.4.2  -p 5203  -t 40 -b 20G > iperf_client_2.log &
    taskset -c 3 iperf3 --client 192.168.4.2  -p 5204  -t 40 -b 20G > iperf_client_3.log &
    taskset -c 4 iperf3 --client 192.168.4.2  -p 5205 -t 40 -b 20G > iperf_client_4.log &
    taskset -c 5 iperf3 --client 192.168.4.2  -p 5201  -t 40 -b 20G > iperf_client_5.log &

The result is the following:

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-40.00  sec  22.7 GBytes  4.88 Gbits/sec  5432             sender
[  5]   0.00-40.04  sec  22.7 GBytes  4.87 Gbits/sec                  receiver

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-40.00  sec  23.5 GBytes  5.06 Gbits/sec  4837             sender
[  5]   0.00-40.04  sec  23.5 GBytes  5.05 Gbits/sec                  receiver

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-40.00  sec  24.5 GBytes  5.25 Gbits/sec  7406             sender
[  5]   0.00-40.04  sec  24.4 GBytes  5.24 Gbits/sec                  receiver

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-40.00  sec  34.5 GBytes  7.41 Gbits/sec  11417             sender
[  5]   0.00-40.04  sec  34.5 GBytes  7.40 Gbits/sec                  receiver

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-40.00  sec  32.9 GBytes  7.06 Gbits/sec  7023             sender
[  5]   0.00-40.04  sec  32.9 GBytes  7.05 Gbits/sec                  receiver

These 5 client add up to about 30Gbps. I also tried to run 1, 2, 3 or 4 clients, the total bandwidth is also about 30Gbps. It seems that no matter how many clients are started, the total bandwidth is around 30Gbps.

Also, to confirm whether it is the link rate problem of CMAC, I connect FPGA with a mellanox connectx-5 and use ethtool to check the negotiated rate of mellanox connectx-5. It said the speed is 100Gbps. But the test result of iperf is still about 26Gbps.

It seems weird that the total bandwidth doesn't scale with CPU cores. Looks like there is a bottleneck somewhere that is limiting the total bandwidth, but I don't know where it is.

Is there anything I'm missing or which tests should I add? If there is anything I need to add, please feel free to say and I will add more information. Really thank you for help!