NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

The network bandwidth in the alltoall_perf test failed to meet expectations #209

Open fj1425fj opened 3 months ago

fj1425fj commented 3 months ago

RoCE bond network bandwidth can reach 180+ GB/s per NIC (mlx5_bond_x) when using the ib_write_bw tool. When I used four devices, the alltoall test results were as expected, but with three devices, the bandwidth was only half as expected.

Have you ever encountered this phenomenon? What are the possible reasons for this phenomenon? Looking forward to your reply.

the nccl-tests result is following

mpirun --allow-run-as-root --host xxxx -x UCX_NET_DEVICES=mlx5_bond_0:1 -x UCX_IB_GID_INDEX=3 -x LD_LIBRARY_PATH=/root/nccl-bond/build/lib:$LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME==bond0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_IB_TC=136 -x NCCL_IB_HCA==mlx5_bond_0 -x NCCL_P2P_DISABLE=1 -x NCCL_SHM_DISABLE=1 /home/test/nccl-tests/build/alltoall_perf -b 2M -e 4096M -f 2 -g 2 -n 20

Test results of four devices:

# nThread 1 nGpus 1 minBytes 67108864 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 2682161 on server1 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 2682162 on server1 device  2 [0x52] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 3139299 on  server2 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 3139300 on  server2 device  2 [0x52] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid  50064 on  server3 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid  50065 on  server3 device  2 [0x52] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid 2672680 on server4 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid 2672681 on server4 device  2 [0x52] NVIDIA A100-SXM4-80GB
NCCL version 2.18.3+cuda12.2
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    67108864       2097152     float    none      -1   2665.2   25.18   22.03      0   2662.0   25.21   22.06    N/A
   134217728       4194304     float    none      -1   5224.1   25.69   22.48      0   5264.2   25.50   22.31    N/A
   268435456       8388608     float    none      -1    10289   26.09   22.83      0    10334   25.97   22.73    N/A
   536870912      16777216     float    none      -1    20513   26.17   22.90      0    20585   26.08   22.82    N/A
  1073741824      33554432     float    none      -1    40882   26.26   22.98      0    41022   26.17   22.90    N/A
  2147483648      67108864     float    none      -1    81711   26.28   23.00      0    81959   26.20   22.93    N/A
  4294967296     134217728     float    none      -1   163115   26.33   23.04      0   163963   26.19   22.92    N/A

Test results of three devices:

# nThread 1 nGpus 1 minBytes 67108864 maxBytes 4294967296 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 2617867 on server1 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 2617868 on server1 device  2 [0x52] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 3103671 on  server2 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 3103672 on  server2 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid 2637126 on server3 device  0 [0x23] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid 2637127 on server3 device  2 [0x52] NVIDIA A100-SXM4-80GB
NCCL version 2.18.3+cuda12.2
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    67108848       2796202     float    none      -1   6499.2   10.33    8.60      0   6136.2   10.94    9.11    N/A
   134217720       5592405     float    none      -1    14519    9.24    7.70      0    13511    9.93    8.28    N/A
   268435440      11184810     float    none      -1    26193   10.25    8.54      0    23691   11.33    9.44    N/A
   536870904      22369621     float    none      -1    58246    9.22    7.68      0    54668    9.82    8.18    N/A
  1073741808      44739242     float    none      -1   105248   10.20    8.50      0    93663   11.46    9.55    N/A
  2147483640      89478485     float    none      -1   233191    9.21    7.67      0   221382    9.70    8.08    N/A
  4294967280     178956970     float    none      -1   420496   10.21    8.51      0   395454   10.86    9.05    N/A
sjeaugey commented 3 months ago

It seems on server2 rank 3 is not using device 2 but device 0 instead. I'm not actually sure how that's possible given rank 2 is also using the same device, but maybe there is an error in the launch script so you end up with 2 ranks using the same NICs?

fj1425fj commented 3 months ago

Sorry, I made a mistake while editing. Rank3 is using device 2.

sjeaugey commented 3 months ago

The bad performance might just be misalignment issues. If you look at the number of elements, every other size is aligned to 2 elements and every other is aligned to 1. Given those are floats we're aligned to 4 bytes or 8 bytes, but never 16 which gives good performance. That's because we divide the total size by the number of ranks, so when you run on numbers of ranks which are not a power of two, you should use a start size that's a multiple of the number of ranks. E.g. -b 3M instead of -b 2M.

fj1425fj commented 1 month ago

Thank you for your answer. I tested that this phenomenon would not occur if independent IP was used instead of bond. Do you know why?