NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

Expected bandwidth results? 8x A100 GPUs over NVLink #149

Open acgandhi opened 1 year ago

acgandhi commented 1 year ago

Hi! I wanted to verify that the nccl-test results that I am getting match up with what should be expected. Our configuration is an HPE Apollo 6500 machine with 8xA100 80GB GPUs connected together with NVLink. I believe each NVLink is capable of 600 GB/s transfers (for an aggregate bandwidth of 4.8 TB/s), however the nccl tests show a busbw of ~200 GB/s. Is this expected? I also ran the p2p bandwidth and latency test [results], which showed bidirectional GPU-GPU transfers in the range of 430-520 GB/s. Tests were build and run in the Nvidia HPC Docker container v23.5 with libnccl version 2.18.1 in the container and 2.17.1 on the host. For all nccl tests see here, all_reduce shown below.

./build/all_reduce_perf -b 8 -e 1G -f 4 -g 1 -t 8
# nThread 8 nGpus 1 minBytes 8 maxBytes 1073741824 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid     78 on 930ae2939a89 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid     78 on 930ae2939a89 device  1 [0x0b] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid     78 on 930ae2939a89 device  2 [0x48] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid     78 on 930ae2939a89 device  3 [0x4c] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid     78 on 930ae2939a89 device  4 [0x88] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid     78 on 930ae2939a89 device  5 [0x8b] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid     78 on 930ae2939a89 device  6 [0xc9] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid     78 on 930ae2939a89 device  7 [0xcc] NVIDIA A100-SXM4-80GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    20.65    0.00    0.00      0    20.35    0.00    0.00      0
          32             8     float     sum      -1    20.77    0.00    0.00      0    20.53    0.00    0.00      0
         128            32     float     sum      -1    22.40    0.01    0.01      0    21.52    0.01    0.01      0
         512           128     float     sum      -1    22.57    0.02    0.04      0    22.13    0.02    0.04      0
        2048           512     float     sum      -1    23.55    0.09    0.15      0    23.25    0.09    0.15      0
        8192          2048     float     sum      -1    29.78    0.28    0.48      0    27.21    0.30    0.53      0
       32768          8192     float     sum      -1    35.47    0.92    1.62      0    31.55    1.04    1.82      0
      131072         32768     float     sum      -1    35.58    3.68    6.45      0    33.16    3.95    6.92      0
      524288        131072     float     sum      -1    36.98   14.18   24.81      0    34.00   15.42   26.99      0
     2097152        524288     float     sum      -1    76.80   27.31   47.78      0    73.82   28.41   49.72      0
     8388608       2097152     float     sum      -1    154.5   54.31   95.04      0    134.3   62.46  109.31      0
    33554432       8388608     float     sum      -1    352.3   95.24  166.67      0    355.7   94.33  165.09      0
   134217728      33554432     float     sum      -1   1149.0  116.81  204.42      0   1148.9  116.82  204.44      0
   536870912     134217728     float     sum      -1   4165.7  128.88  225.54      0   4159.4  129.07  225.88      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 55.8535 
#
$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     48-63,176-191   3
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     48-63,176-191   3
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     16-31,144-159   1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     16-31,144-159   1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     112-127,240-255 7
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     112-127,240-255 7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     80-95,208-223   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     80-95,208-223   5
NIC0    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
NIC1    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

Thank you!

sjeaugey commented 1 year ago

I believe each NVLink is capable of 600 GB/s transfers

That's the line rate of 12 NVLinks, adding both directions. Each NVLink has a line rate of 25GB/s (in each direction) which converts to ~20GB/s effective bandwidth with a 128B payload size (*). The total effective bandwidth should therefore be 240GB/s on A100, but some other bottlenecks on the GPUs limits it to ~230GB/s.

You should run again the test with a larger size. At 512M the bandwidth is still increasing, so you should replace 1G by 8G. That should reach 230 GB/s.

(*) SMs are limited to 128B payload size, when Copy Engines can use 256B. The p2pBandwidthLatency test by default uses the CE, so you'll see higher performance than what you can reach from the SM. To test the SM bandwidth, you can run p2pBandwidthLatency with the --sm_copy option.

zobinHuang commented 1 year ago

Hi sjeaugey,

According to wikipedia - NVLink Performance, NVLink3.0 reaches ~6.25 GBps (50 Gbps) per differential pair, which means that in theory a single sublink which forms by 8 differential pairs would reach 6.25 x 8 = 50GBps.

Each NVLink has a line rate of 25GB/s (in each direction) which converts to ~20GB/s effective bandwidth with a 128B payload size (*)

But in fact the line rate of sublink is only 25GBps, could you please tell me why there's such a gap? Thanks!

sjeaugey commented 1 year ago

IIRC on NVLink3 we have only 4 pairs, not 8.

zobinHuang commented 1 year ago

IIRC on NVLink3 we have only 4 pairs, not 8.

okay thanks

zhang662817 commented 9 months ago

That's the line rate of 12 NVLinks, adding both directions. Each NVLink has a line rate of 25GB/s (in each direction) which converts to ~20GB/s effective bandwidth with a 128B payload size (*). The total effective bandwidth should therefore be 240GB/s on A100, but some other bottlenecks on the GPUs limits it to ~230GB/s

Hi @sjeaugey Unidirectional is ~20GB/s, but Bidirectional should be ~40GB/s, so “The total effective bandwidth should therefore be 480GB/s = 12 20 2 on A100”,right?

Why 240GB/s on A100?

sjeaugey commented 9 months ago

We don't multiply the bandwidths by two, accounting for each direction. No technology these days is half-duplex, so it's more natural to report BW which can correspond to the NIC speed, or PCI bandwidth. And yes, NVLinks BW are usually advertised adding both directions; that's what it is, but we can't reconcile the two approaches. It's simply a convention.

zhang662817 commented 9 months ago

image

@sjeaugey in cuda sample p2p test,why nvlink bw can reach 528 GB/s, higher than ”theoretical 300GB/s“?

AddyLaddy commented 9 months ago

Same answer as above. They report bidirectional BW as sum of send and receive, so 2x what NCCL reports.

HeGaoYuan commented 8 months ago

We don't multiply the bandwidths by two, accounting for each direction. No technology these days is half-duplex, so it's more natural to report BW which can correspond to the NIC speed, or PCI bandwidth. And yes, NVLinks BW are usually advertised adding both directions; that's what it is, but we can't reconcile the two approaches. It's simply a convention.

I have two servers that each has a bond RDMA HCA.

I use ib_write_bw command to test the bindwidth, the result is 185Gb/s. ( ib_write_bw -d mlx5_bond_1 -x 3)

I use nccl_test(2.10.3) to test the bindwidth, the result is 88Gb/s. (mpirun -H host1:1,host2:1 -n 2 -N 1 ./build/all_reduce_perf -b 32M -e 1G -f 2 -g 1)

Is the result expected?

@sjeaugey @AddyLaddy Looking forward to your reply. If you need any detail informations, plesae tell me.

ssyrc commented 3 months ago

Hi @sjeaugey ,

That's the line rate of 12 NVLinks, adding both directions. Each NVLink has a line rate of 25GB/s (in each direction) which converts to ~20GB/s effective bandwidth with a 128B payload size (*). The total effective bandwidth should therefore be 240GB/s on A100, but some other bottlenecks on the GPUs limits it to ~230GB/s. You should run again the test with a larger size. At 512M the bandwidth is still increasing, so you should replace 1G by 8G. That should reach 230 GB/s.

I'm afraid that I don't fully understand collective communication. In an 8GPU server, there are 6 NVSwitches on each GPU board, and each GPU is connected to 2 NVLinks per NVSwitch. So, there are 12 NVLinks per GPU. However, I don't understand why the total bandwidth for All-reduce communication between 8 GPUs is 20GB/s * 12 Links.

I understand that we need to multiply by about 20GB/s due to the bottleneck, but becuase it is all-reduce communication, shouldn't we multiply by a different number like 16 Links (8 GPU*2) instead of 12 Links ?