NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
876 stars 238 forks source link

nccl test only gets ~65% of the link bandwidth #112

Closed sandyhouse closed 2 years ago

sandyhouse commented 2 years ago

Environment:

2 nodes with A100 GPUs intra-connected with PCIe Gen 4 and NVLink, inter-connected with four 200Gb RoCEv2 NICs. NCCL version is: 2.11.4 CUDA version is: 11.6 The result of ib_write_bw is about 180Gb GPU Topo: image

nccl-tests command:

mpirun -np 16 -H host1:8,host2:8 -bind-to none -map-by slot -x NCCL_TOPO_DUMP_FILE=$PWD/topology.xml -x LD_LIBRARY_PATH -x PATH -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_HCA=mlx5_0 -mca btl_tcp_if_include ens10np0 -mca pml ob1 -mca btl ^openib --allow-run-as-root ./build/all_reduce_perf -b 1G -e 8G -f 2

image

As the result shows, the average bus bandwidth is only 9.84 GB/s (about 80Gbps), which is far away from the link bandwidth.

Any suggestions?

sjeaugey commented 2 years ago

Can you attach the topology file you generated with NCCL_TOPO_DUMP_FILE=$PWD/topology.xml?

sandyhouse commented 2 years ago

Can you attach the topology file you generated with NCCL_TOPO_DUMP_FILE=$PWD/topology.xml?

Sorry for the late reply. The topology file is attached. topology.txt

Note: we have four 200Gbps NICs, but in the above experimental, only one NIC is used. The performance increases linearly as the number of NICs used increased.

sjeaugey commented 2 years ago

So, with 4 NICs you would get ~40 GB/s?

Did you make sure ACS was disabled? That could explain why bandwidth is halved.

sandyhouse commented 2 years ago

ACS

Thanks for your reply. The ACS is enabled on my machines and I'll disable the function and have a retry.

sandyhouse commented 2 years ago

So, with 4 NICs you would get ~40 GB/s?

Did you make sure ACS was disabled? That could explain why bandwidth is halved.

Now I disabled I/O virtualization and ACS. But the bus bandwidth is only 20.78GB (~166Gbps vs 200Gbps NIC), is this right? image

sandyhouse commented 2 years ago

Additionally, When I use two NICs, I can get about 41GBps bus bandwidth. But when I use four NICs, I only get 41GBps bus bandwidth. Any idea? @sjeaugey

sjeaugey commented 2 years ago

20GB/s per NIC is not uncommon on RoCE, especially if you didn't bump the MTU to 9000.

For the 4 NIC problem, can you dump the topology again, this time running with all 4 NICs and all 8 GPUs? I can only see one NIC in the previous topology.

sandyhouse commented 2 years ago

Plz see the attached file. @sjeaugey topology.txt

sjeaugey commented 2 years ago

Seems like GPU Direct RDMA is missing. That's why you don't get 4x the performance but only 2x. That could also explain the not-ideal performance. GPU Direct was there in the first topology file though so I guess it was just an oversight after rebooting.

In the XML topology, net tags have gdr=0 instead of gdr=1.

sandyhouse commented 2 years ago

Right, the nv_peer_mem service stopped after reboot, and the problem addressed. Thanks for you reply.