NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

NCCL Tree allreduce test cannot reach the theoretical bus bandwidth on 2 nodes with 4 nics #232

Closed ProHuper closed 3 weeks ago

ProHuper commented 3 weeks ago
$ nvidia-smi topo -m

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     SYS     SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     PIX     SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     PIX     SYS     SYS     SYS     48-95,144-191   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     48-95,144-191   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     PIX     SYS     SYS     48-95,144-191   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     PIX     SYS     48-95,144-191   1               N/A
NIC0    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS
NIC1    SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS      X      SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS      X      SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_4
  NIC3: mlx5_5
  NIC4: mlx5_6
  NIC5: mlx5_bond_0

2 nodes allreduce test,8 H100 each node,using 4 nics,busbw is 309,theoretical busbw should be 360。

$ mpirun --allow-run-as-root --hostfile hosts.txt  --oversubscribe  -x  NCCL_ALGO=Tree -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_4,mlx5_5 -np 16 ./all_reduce_perf -b 2M -e 16G -f 2 -n 10 -g 1 -w 10

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s) 
     2097152        524288     float     sum      -1    118.1   17.75   33.29      0    92.65   22.64   42.44      0
     4194304       1048576     float     sum      -1    104.8   40.01   75.03      0    105.4   39.78   74.59      0
     8388608       2097152     float     sum      -1    140.7   59.60  111.75      0    142.9   58.72  110.10      0
    16777216       4194304     float     sum      -1    231.9   72.33  135.62      0    237.8   70.56  132.29      0
    33554432       8388608     float     sum      -1    412.3   81.39  152.60      0    417.3   80.40  150.75      0
    67108864      16777216     float     sum      -1    663.5  101.14  189.64      0    672.7   99.76  187.05      0
   134217728      33554432     float     sum      -1   1168.2  114.89  215.42      0   1311.3  102.35  191.91      0
   268435456      67108864     float     sum      -1   2130.3  126.01  236.27      0   2130.6  125.99  236.23      0
   536870912     134217728     float     sum      -1   3611.0  148.68  278.77      0   3603.2  149.00  279.37      0
  1073741824     268435456     float     sum      -1   6793.3  158.06  296.36      0   6781.1  158.34  296.89      0
  2147483648     536870912     float     sum      -1    13184  162.89  305.41      0    13129  163.56  306.68      0
  4294967296    1073741824     float     sum      -1    25986  165.28  309.90      0    25893  165.87  311.01      0

2 nodes allreduce test,1 H100 each node,using 4 nics,busbw is 50,theoretical busbw should be 200。

$ mpirun --allow-run-as-root --hostfile hosts.txt  --oversubscribe  -x  NCCL_ALGO=Tree -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_4,mlx5_5 -np 2 ./all_reduce_perf -b 2M -e 16G -f 2 -n 10 -g 1 -w 10

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s) 
     2097152        524288     float     sum      -1    113.2   18.53   18.53      0    93.35   22.46   22.46      0
     4194304       1048576     float     sum      -1    154.4   27.16   27.16      0    153.3   27.37   27.37      0
     8388608       2097152     float     sum      -1    231.4   36.24   36.24      0    227.8   36.83   36.83      0
    16777216       4194304     float     sum      -1    420.5   39.90   39.90      0    419.9   39.95   39.95      0
    33554432       8388608     float     sum      -1    812.3   41.31   41.31      0    808.2   41.52   41.52      0
    67108864      16777216     float     sum      -1   1545.1   43.43   43.43      0   1561.3   42.98   42.98      0
   134217728      33554432     float     sum      -1   2973.1   45.14   45.14      0   2970.4   45.19   45.19      0
   268435456      67108864     float     sum      -1   5715.9   46.96   46.96      0   5676.1   47.29   47.29      0
   536870912     134217728     float     sum      -1    11146   48.17   48.17      0    11156   48.12   48.12      0
  1073741824     268435456     float     sum      -1    22062   48.67   48.67      0    21997   48.81   48.81      0
  2147483648     536870912     float     sum      -1    43733   49.10   49.10      0    43697   49.15   49.15      0
  4294967296    1073741824     float     sum      -1    87278   49.21   49.21      0    87197   49.26   49.26      0
  8589934592    2147483648     float     sum      -1   174121   49.33   49.33      0   174234   49.30   49.30      0
 17179869184    4294967296     float     sum      -1   347919   49.38   49.38      0   347833   49.39   49.39      0

LOG INFO shows GDR only used 1 nic.
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:38570:38584 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 02/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 03/0 : 1[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48051 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:48036:48049 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu20:38570:38582 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu20:38570:38582 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu19:48036:48049 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu19:48036:48051 [0] NCCL INFO Connected all rings
qh100-gpu20:38570:38584 [0] NCCL INFO Connected all rings