NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

if the bandwidth results of the Nccl test are related to the number of nodes? #169

Open PrometheusComing opened 11 months ago

PrometheusComing commented 11 months ago

Hello, I have 6 A100 (80G) machines (each with 8 A100 cards) and using the Roce network, I get around 100GB/s bandwidth on every two machines doing Nccl Test. However, when I select three servers for the test, the bandwidth decreases to 70 GB/s. When I select four or five servers for the test, the bandwidth is also about 58 GB/s. When I select six servers for the test, the bandwidth decreases to 40 GB/s. I wonder if the bandwidth results of the Nccl test are related to the number of machines participating? Or is it related to the number of distributed cards involved?

The command I tested using was all_reduce_perf -f 2 -b 1G -e 8G and mpirun. I'm sorry I can't post all the test reports.

AddyLaddy commented 11 months ago

Running the AllReduce nccl-test on two nodes can give perf figures that exceed the capabilities of the external networking card unless you set NCCL_ALGO=RING. This is not the case for jobs run on 3 or more servers. How many RoCE NICs do you have installed per node and what speed are they? Are they attached to the CPU or a PCI-E switch?

PrometheusComing commented 10 months ago

Running the AllReduce nccl-test on two nodes can give perf figures that exceed the capabilities of the external networking card unless you set NCCL_ALGO=RING. This is not the case for jobs run on 3 or more servers. How many RoCE NICs do you have installed per node and what speed are they? Are they attached to the CPU or a PCI-E switch?

Thanks for the reply.Each node has 8 Roce NICs, all of which are 100Gb. And it's all on PCIe.