NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
896 stars 241 forks source link

nccl-tests result is only a half of ib_write_bw #191

Open HeGaoYuan opened 11 months ago

HeGaoYuan commented 11 months ago

I have a bond RDMA device with two origin device, each bandwidth is 100Gb/s, the ib_write_bw can reach the 185Gb/s bandwidth, but the nccl-tests just can reach 95Gb/s, and our monitor shows only one of origin device has traffic. Is it expected?

Following is all my environment information. Please feel free to ask me any other informations.

Looking forward to your reply. Thanks!

the ib_write_bw result is following

 ib_write_bw -x 3 -q 8 -d mlx5_bond_1 {serverip} --report_gbits --run_infinitely
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_bond_1
 Number of qps   : 8            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API: ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      1765470            185.37              184.94              0.352743

the nccl-tests result is following (the highest busbw result in history is 11.xx GB/s)

mpirun --oversubscribe --allow-run-as-root -mca plm_rsh_args "-p 2222 -q -o StrictHostKeyChecking=no" \
    -n 2 -N 1 -H 192.xxx.xxx.111:1,192.xxx.xxx.104:1 \
    -bind-to socket -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_base_help_aggregate 0 \
    -mca btl_tcp_if_include eth0 -mca coll_hcoll_enable 0 -x NCCL_DEBUG=INFO -x NCCL_PXN_DISABLE=1 \
    -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_QPS_PER_CONNECTION=8 -x NCCL_NET_GDR_LEVEL=1 -x NCCL_IB_HCA=mlx5_bond_1 -x NCCL_IB_GID_INDEX=3 \ 
    ~/nccl-tests-2.10.1/build/all_reduce_perf -b 128M-e 1G -f 2 -g 1 -c 1 -n 100

# size              count        type    redop    time    algbw    busbw  error    time     algbw    busbw    error
#  (B)               (elements)                         (us)      (GB/s)   (GB/s)               (us)       (GB/s)   (GB/s)
# 134217728  33554432  float    sum     13970   9.61      9.61     0e+00   13967   9.61      9.61        0e+00
# 268435456  67108864  float    sum     27710   9.69      9.69     0e+00   27770   9.67      9.67       0e+00
.....

the mlx5_bond_1 is a bond RDMA network device with two device/network_interface.

the cat /proc/net/bonding/bondpcie1 (corresponding to mlx5_bond_1) result shows the bond mode is 802.3ad, the Transmit Hash Policy: layer3+4(1), each network_interface's bandwidth is 100Gb/s.

the ibstatus result of mlx5_bond_1 shows rate is 100Gb/sec (4X EDR)

the nccl-test is 2.10.3, the nccl version is 2.14.3

the device is ConnectX-6 Dx, the ofed_driver is MLNX_OFED_LINUX-5.8-3.0.7.0-rhel7.9-x86_64.iso, the linux kernel is 3.10.0

renwuli commented 1 week ago

do you have any updates on this issue and have you fixed it?

913871734 commented 1 week ago

how do you resolved the question? I met the same question as yours.

AddyLaddy commented 1 week ago

Using bonded devices is not recommended with NCCL. NCCL can most efficiently drive both cards if they are not bonded and presented individually to it.

renwuli commented 5 days ago

Using bonded devices is not recommended with NCCL. NCCL can most efficiently drive both cards if they are not bonded and presented individually to it.

Hi @AddyLaddy , can you explain more? Why rdma perf on bonded devices is perfect while NCCL can only achieve half of the bandwidth of which ib_write_bw achieves?

913871734 commented 5 days ago

Hi, @AddyLaddy . I have read the nccl source code (after commit 2.20.3), which add support for port fusion in NET/IB. So i'm confused Why does the bond port not support well, but the two separate ports support well? And the commit's message which support the fusion port,doesn't it mean support for bond ports?

AddyLaddy commented 5 days ago

NCCL uses advanced PCI-E topology detection to determine which NICs are close to each GPU and then will drive all available NICs in parallel to achieve the peak BW of that system. Bonding may hide away the real physical location of the NICs and also probably doesn't report the aggerate speed of the 2 NICs to NCCL so it can't determine how much resource should be dedicated to driving that bonded NIC. I believe Port Fusion will be the preferred method going forward.

913871734 commented 4 days ago

You mean that the bond port will be parsed as a single port by nccl, and its two member ports cannot be identified, so nccl cannot allocate resources well for the two member ports, right? And I still don't understand completely, does fusion port refer to bond nic? If so, the current nccl version supports the use of fusion nic. then doesn't the current nccl version already support bond? Looking forward to your reply and explanation.

NCCL uses advanced PCI-E topology detection to determine which NICs are close to each GPU and then will drive all available NICs in parallel to achieve the peak BW of that system. Bonding may hide away the real physical location of the NICs and also probably doesn't report the aggerate speed of the 2 NICs to NCCL so it can't determine how much resource should be dedicated to driving that bonded NIC. I believe Port Fusion will be the preferred method going forward.

913871734 commented 1 day ago

NCCL uses advanced PCI-E topology detection to determine which NICs are close to each GPU and then will drive all available NICs in parallel to achieve the peak BW of that system. Bonding may hide away the real physical location of the NICs and also probably doesn't report the aggerate speed of the 2 NICs to NCCL so it can't determine how much resource should be dedicated to driving that bonded NIC. I believe Port Fusion will be the preferred method going forward.

How can I configure the bond mode so that nccl can maximize its performance?