NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

nccl-tests result is only a half of ib_write_bw #191

Open HeGaoYuan opened 7 months ago

HeGaoYuan commented 7 months ago

I have a bond RDMA device with two origin device, each bandwidth is 100Gb/s, the ib_write_bw can reach the 185Gb/s bandwidth, but the nccl-tests just can reach 95Gb/s, and our monitor shows only one of origin device has traffic. Is it expected?

Following is all my environment information. Please feel free to ask me any other informations.

Looking forward to your reply. Thanks!

the ib_write_bw result is following

 ib_write_bw -x 3 -q 8 -d mlx5_bond_1 {serverip} --report_gbits --run_infinitely
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_bond_1
 Number of qps   : 8            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API: ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      1765470            185.37              184.94              0.352743

the nccl-tests result is following (the highest busbw result in history is 11.xx GB/s)

mpirun --oversubscribe --allow-run-as-root -mca plm_rsh_args "-p 2222 -q -o StrictHostKeyChecking=no" \
    -n 2 -N 1 -H 192.xxx.xxx.111:1,192.xxx.xxx.104:1 \
    -bind-to socket -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_base_help_aggregate 0 \
    -mca btl_tcp_if_include eth0 -mca coll_hcoll_enable 0 -x NCCL_DEBUG=INFO -x NCCL_PXN_DISABLE=1 \
    -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_QPS_PER_CONNECTION=8 -x NCCL_NET_GDR_LEVEL=1 -x NCCL_IB_HCA=mlx5_bond_1 -x NCCL_IB_GID_INDEX=3 \ 
    ~/nccl-tests-2.10.1/build/all_reduce_perf -b 128M-e 1G -f 2 -g 1 -c 1 -n 100

# size              count        type    redop    time    algbw    busbw  error    time     algbw    busbw    error
#  (B)               (elements)                         (us)      (GB/s)   (GB/s)               (us)       (GB/s)   (GB/s)
# 134217728  33554432  float    sum     13970   9.61      9.61     0e+00   13967   9.61      9.61        0e+00
# 268435456  67108864  float    sum     27710   9.69      9.69     0e+00   27770   9.67      9.67       0e+00
.....

the mlx5_bond_1 is a bond RDMA network device with two device/network_interface.

the cat /proc/net/bonding/bondpcie1 (corresponding to mlx5_bond_1) result shows the bond mode is 802.3ad, the Transmit Hash Policy: layer3+4(1), each network_interface's bandwidth is 100Gb/s.

the ibstatus result of mlx5_bond_1 shows rate is 100Gb/sec (4X EDR)

the nccl-test is 2.10.3, the nccl version is 2.14.3

the device is ConnectX-6 Dx, the ofed_driver is MLNX_OFED_LINUX-5.8-3.0.7.0-rhel7.9-x86_64.iso, the linux kernel is 3.10.0