Open acgandhi opened 1 year ago
I believe each NVLink is capable of 600 GB/s transfers
That's the line rate of 12 NVLinks, adding both directions. Each NVLink has a line rate of 25GB/s (in each direction) which converts to ~20GB/s effective bandwidth with a 128B payload size (*). The total effective bandwidth should therefore be 240GB/s on A100, but some other bottlenecks on the GPUs limits it to ~230GB/s.
You should run again the test with a larger size. At 512M the bandwidth is still increasing, so you should replace 1G
by 8G
. That should reach 230 GB/s.
(*) SMs are limited to 128B payload size, when Copy Engines can use 256B. The p2pBandwidthLatency test by default uses the CE, so you'll see higher performance than what you can reach from the SM. To test the SM bandwidth, you can run p2pBandwidthLatency with the --sm_copy option.
Hi sjeaugey,
According to wikipedia - NVLink Performance, NVLink3.0 reaches ~6.25 GBps (50 Gbps) per differential pair, which means that in theory a single sublink which forms by 8 differential pairs would reach 6.25 x 8 = 50GBps.
Each NVLink has a line rate of 25GB/s (in each direction) which converts to ~20GB/s effective bandwidth with a 128B payload size (*)
But in fact the line rate of sublink is only 25GBps, could you please tell me why there's such a gap? Thanks!
IIRC on NVLink3 we have only 4 pairs, not 8.
IIRC on NVLink3 we have only 4 pairs, not 8.
okay thanks
That's the line rate of 12 NVLinks, adding both directions. Each NVLink has a line rate of 25GB/s (in each direction) which converts to ~20GB/s effective bandwidth with a 128B payload size (*). The total effective bandwidth should therefore be 240GB/s on A100, but some other bottlenecks on the GPUs limits it to ~230GB/s
Hi @sjeaugey Unidirectional is ~20GB/s, but Bidirectional should be ~40GB/s, so “The total effective bandwidth should therefore be 480GB/s = 12 20 2 on A100”,right?
Why 240GB/s on A100?
We don't multiply the bandwidths by two, accounting for each direction. No technology these days is half-duplex, so it's more natural to report BW which can correspond to the NIC speed, or PCI bandwidth. And yes, NVLinks BW are usually advertised adding both directions; that's what it is, but we can't reconcile the two approaches. It's simply a convention.
@sjeaugey in cuda sample p2p test,why nvlink bw can reach 528 GB/s, higher than ”theoretical 300GB/s“?
Same answer as above. They report bidirectional BW as sum of send and receive, so 2x what NCCL reports.
We don't multiply the bandwidths by two, accounting for each direction. No technology these days is half-duplex, so it's more natural to report BW which can correspond to the NIC speed, or PCI bandwidth. And yes, NVLinks BW are usually advertised adding both directions; that's what it is, but we can't reconcile the two approaches. It's simply a convention.
I have two servers that each has a bond RDMA HCA.
I use ib_write_bw command to test the bindwidth, the result is 185Gb/s. ( ib_write_bw -d mlx5_bond_1 -x 3)
I use nccl_test(2.10.3) to test the bindwidth, the result is 88Gb/s. (mpirun -H host1:1,host2:1 -n 2 -N 1 ./build/all_reduce_perf -b 32M -e 1G -f 2 -g 1)
Is the result expected?
@sjeaugey @AddyLaddy Looking forward to your reply. If you need any detail informations, plesae tell me.
Hi @sjeaugey ,
That's the line rate of 12 NVLinks, adding both directions. Each NVLink has a line rate of 25GB/s (in each direction) which converts to ~20GB/s effective bandwidth with a 128B payload size (*). The total effective bandwidth should therefore be 240GB/s on A100, but some other bottlenecks on the GPUs limits it to ~230GB/s. You should run again the test with a larger size. At 512M the bandwidth is still increasing, so you should replace 1G by 8G. That should reach 230 GB/s.
I'm afraid that I don't fully understand collective communication. In an 8GPU server, there are 6 NVSwitches on each GPU board, and each GPU is connected to 2 NVLinks per NVSwitch. So, there are 12 NVLinks per GPU. However, I don't understand why the total bandwidth for All-reduce communication between 8 GPUs is 20GB/s * 12 Links.
I understand that we need to multiply by about 20GB/s due to the bottleneck, but becuase it is all-reduce communication, shouldn't we multiply by a different number like 16 Links (8 GPU*2) instead of 12 Links ?
Hi! I wanted to verify that the nccl-test results that I am getting match up with what should be expected. Our configuration is an HPE Apollo 6500 machine with 8xA100 80GB GPUs connected together with NVLink. I believe each NVLink is capable of 600 GB/s transfers (for an aggregate bandwidth of 4.8 TB/s), however the nccl tests show a busbw of ~200 GB/s. Is this expected? I also ran the p2p bandwidth and latency test [results], which showed bidirectional GPU-GPU transfers in the range of 430-520 GB/s. Tests were build and run in the Nvidia HPC Docker container v23.5 with libnccl version 2.18.1 in the container and 2.17.1 on the host. For all nccl tests see here, all_reduce shown below.
Thank you!