Open PrometheusComing opened 11 months ago
Running the AllReduce nccl-test on two nodes can give perf figures that exceed the capabilities of the external networking card unless you set NCCL_ALGO=RING
. This is not the case for jobs run on 3 or more servers.
How many RoCE NICs do you have installed per node and what speed are they?
Are they attached to the CPU or a PCI-E switch?
Running the AllReduce nccl-test on two nodes can give perf figures that exceed the capabilities of the external networking card unless you set
NCCL_ALGO=RING
. This is not the case for jobs run on 3 or more servers. How many RoCE NICs do you have installed per node and what speed are they? Are they attached to the CPU or a PCI-E switch?
Thanks for the reply.Each node has 8 Roce NICs, all of which are 100Gb. And it's all on PCIe.
Hello, I have 6 A100 (80G) machines (each with 8 A100 cards) and using the Roce network, I get around 100GB/s bandwidth on every two machines doing Nccl Test. However, when I select three servers for the test, the bandwidth decreases to 70 GB/s. When I select four or five servers for the test, the bandwidth is also about 58 GB/s. When I select six servers for the test, the bandwidth decreases to 40 GB/s. I wonder if the bandwidth results of the Nccl test are related to the number of machines participating? Or is it related to the number of distributed cards involved?
The command I tested using was all_reduce_perf -f 2 -b 1G -e 8G and mpirun. I'm sorry I can't post all the test reports.