NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3k stars 762 forks source link

2 node 16 H20 GPU allreduce performance is not as expected with NVL sharp #1244

Open vickkylu opened 3 months ago

vickkylu commented 3 months ago

2node16 H20 GPU allredcue performance is 343GBps(with NVL SHARP),But theoretically it should be able to reach 460GBps

     1048576        262144     float     sum      -1    121.8    8.61   16.14      0    126.6    8.28   15.53      0
     2097152        524288     float     sum      -1    129.8   16.16   30.30      0    127.0   16.51   30.96      0
     4194304       1048576     float     sum      -1    127.6   32.87   61.63      0    124.1   33.79   63.35      0
     8388608       2097152     float     sum      -1    146.5   57.27  107.38      0    145.2   57.78  108.34      0
    16777216       4194304     float     sum      -1    203.0   82.64  154.95      0    199.1   84.26  157.98      0
    33554432       8388608     float     sum      -1    298.5  112.40  210.76      0    294.5  113.92  213.60      0
    67108864      16777216     float     sum      -1    467.9  143.41  268.90      0    466.9  143.72  269.47      0
   134217728      33554432     float     sum      -1   1108.0  121.14  227.14      0   1111.2  120.79  226.48      0
   268435456      67108864     float     sum      -1   2055.4  130.60  244.88      0   2057.6  130.46  244.62      0
   536870912     134217728     float     sum      -1   3218.6  166.80  312.75      0   3204.9  167.52  314.10      0
  1073741824     268435456     float     sum      -1   6070.5  176.88  331.65      0   6066.8  176.99  331.85      0
  2147483648     536870912     float     sum      -1    11718  183.26  343.61      0    11735  182.99  343.11      0

single NODE 8 H20 GPU allreduce is 468GBPs

     1048576        262144     float     sum      -1    85.52   12.26   21.46      0    79.19   13.24   23.17      0
     2097152        524288     float     sum      -1    80.52   26.05   45.58      0    84.64   24.78   43.36      0
     4194304       1048576     float     sum      -1    86.01   48.77   85.34      0    93.39   44.91   78.59      0
     8388608       2097152     float     sum      -1    90.54   92.65  162.14      0    94.13   89.12  155.96      0
    16777216       4194304     float     sum      -1    128.1  130.93  229.12      0    130.2  128.89  225.56      0
    33554432       8388608     float     sum      -1    199.8  167.94  293.90      0    201.0  166.95  292.17      0
    67108864      16777216     float     sum      -1    328.0  204.60  358.05      0    327.7  204.77  358.34      0
   134217728      33554432     float     sum      -1    582.1  230.58  403.52      0    581.3  230.90  404.08      0
   268435456      67108864     float     sum      -1   1091.3  245.98  430.47      0   1090.5  246.15  430.76      0
   536870912     134217728     float     sum      -1   2100.7  255.57  447.25      0   2097.6  255.94  447.90      0
  1073741824     268435456     float     sum      -1   4067.9  263.95  461.92      0   4068.3  263.93  461.88      0
  2147483648     536870912     float     sum      -1   8025.6  267.58  468.26      0   8028.1  267.50  468.12      0

Why does performance drop so much from single node to multiple nodes? NCCL version is NCCL 2.20.3

AddyLaddy commented 3 months ago

What is your inter-node connection? 8x NDR InfiniBand? Do you have IB SHARP? In most cases the inter-node IB/RoCE BW will be the bottleneck in multi-node runs. I'd also suggest setting NCCL_ALGO=RING when testing on 2 nodes to get a more representative number.

haswelliris commented 3 months ago

Test in 2 nodes, 16 H20 GPUs ,16 cx7 NICs(IB NDR) with NCCL-2.21.5 + cuda-12.3: It works well using NCCL_ALGO=NVLSTree

                                                      out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           4             1     float     sum    47.54    0.00    0.00  0e+00    49.35    0.00    0.00  4e-07
           8             2     float     sum    47.60    0.00    0.00  2e-07    47.54    0.00    0.00  2e-07
          16             4     float     sum    46.25    0.00    0.00  1e-07    46.27    0.00    0.00  1e-07
          32             8     float     sum    46.35    0.00    0.00  2e-07    46.28    0.00    0.00  2e-07
          64            16     float     sum    46.30    0.00    0.00  1e-07    46.29    0.00    0.00  2e-07
         128            32     float     sum    47.76    0.00    0.01  6e-08    46.52    0.00    0.01  6e-08
         256            64     float     sum    46.65    0.01    0.01  1e-07    46.73    0.01    0.01  6e-08
         512           128     float     sum    46.58    0.01    0.02  6e-08    46.42    0.01    0.02  6e-08
        1024           256     float     sum    46.61    0.02    0.04  2e-07    46.55    0.02    0.04  4e-07
        2048           512     float     sum    47.24    0.04    0.08  4e-07    47.26    0.04    0.08  5e-07
        4096          1024     float     sum    46.19    0.09    0.17  4e-07    46.13    0.09    0.17  4e-07
        8192          2048     float     sum    47.19    0.17    0.33  5e-07    47.21    0.17    0.33  5e-07
       16384          4096     float     sum    48.59    0.34    0.63  5e-07    48.52    0.34    0.63  5e-07
       32768          8192     float     sum    51.94    0.63    1.18  5e-07    51.86    0.63    1.18  5e-07
       65536         16384     float     sum    59.59    1.10    2.06  5e-07    59.15    1.11    2.08  5e-07
      131072         32768     float     sum    60.33    2.17    4.07  5e-07    60.23    2.18    4.08  5e-07
      262144         65536     float     sum    62.81    4.17    7.83  5e-07    62.74    4.18    7.83  5e-07
      524288        131072     float     sum    67.93    7.72   14.47  5e-07    67.51    7.77   14.56  5e-07
     1048576        262144     float     sum    69.28   15.14   28.38  5e-07    69.25   15.14   28.39  5e-07
     2097152        524288     float     sum    84.55   24.80   46.50  5e-07    85.88   24.42   45.79  5e-07
     4194304       1048576     float     sum    112.4   37.33   69.99  5e-07    105.6   39.71   74.45  5e-07
     8388608       2097152     float     sum    147.8   56.77  106.45  5e-07    147.7   56.81  106.51  5e-07
    16777216       4194304     float     sum    198.7   84.43  158.30  5e-07    194.7   86.16  161.54  5e-07
    33554432       8388608     float     sum    265.8  126.24  236.70  5e-07    265.8  126.22  236.66  5e-07
    67108864      16777216     float     sum    461.6  145.39  272.61  5e-07    461.7  145.35  272.53  5e-07
   134217728      33554432     float     sum    737.2  182.05  341.35  5e-07    737.2  182.06  341.36  5e-07
   268435456      67108864     float     sum   1259.7  213.10  399.56  5e-07   1258.7  213.26  399.87  5e-07
   536870912     134217728     float     sum   2307.9  232.62  436.16  5e-07   2305.5  232.86  436.62  5e-07
  1073741824     268435456     float     sum   4396.3  244.24  457.94  5e-07   4398.0  244.15  457.77  5e-07
  2147483648     536870912     float     sum   8564.5  250.74  470.14  5e-07   8588.7  250.04  468.82  5e-07
  4294967296    1073741824     float     sum    16931  253.67  475.63  5e-07    16933  253.65  475.59  5e-07
  8589934592    2147483648     float     sum    33726  254.70  477.56  5e-07    33714  254.79  477.72  5e-07
 17179869184    4294967296     float     sum    67332  255.15  478.41  5e-07    67370  255.01  478.14  5e-07
sjeaugey commented 3 months ago

Indeed on two nodes we should be able to get 460-480 using the NVLSTree algorithm. Now depending on your networking something could make NCCL use rings instead? To dig on that, we'd need to know how many NICs you have and the NIC port speed.

echobinarybytes commented 2 days ago

saying every node have 8 ib,every ib's bandwidth is 400Gbps,so one node 400GB/s totally. Why busbw for two nodes can reach more than 400GB/s for NVLSTree?

I know that inter-node is a tree allreduce, so why it can cross the limit bw of 400GB/s? How can I understand It?