Closed sandyhouse closed 2 years ago
Can you attach the topology file you generated with NCCL_TOPO_DUMP_FILE=$PWD/topology.xml
?
Can you attach the topology file you generated with
NCCL_TOPO_DUMP_FILE=$PWD/topology.xml
?
Sorry for the late reply. The topology file is attached. topology.txt
Note: we have four 200Gbps NICs, but in the above experimental, only one NIC is used. The performance increases linearly as the number of NICs used increased.
So, with 4 NICs you would get ~40 GB/s?
Did you make sure ACS was disabled? That could explain why bandwidth is halved.
ACS
Thanks for your reply. The ACS is enabled on my machines and I'll disable the function and have a retry.
So, with 4 NICs you would get ~40 GB/s?
Did you make sure ACS was disabled? That could explain why bandwidth is halved.
Now I disabled I/O virtualization and ACS. But the bus bandwidth is only 20.78GB (~166Gbps vs 200Gbps NIC), is this right?
Additionally, When I use two NICs, I can get about 41GBps bus bandwidth. But when I use four NICs, I only get 41GBps bus bandwidth. Any idea? @sjeaugey
20GB/s per NIC is not uncommon on RoCE, especially if you didn't bump the MTU to 9000.
For the 4 NIC problem, can you dump the topology again, this time running with all 4 NICs and all 8 GPUs? I can only see one NIC in the previous topology.
Plz see the attached file. @sjeaugey topology.txt
Seems like GPU Direct RDMA is missing. That's why you don't get 4x the performance but only 2x. That could also explain the not-ideal performance. GPU Direct was there in the first topology file though so I guess it was just an oversight after rebooting.
In the XML topology, net
tags have gdr=0
instead of gdr=1
.
Right, the nv_peer_mem service stopped after reboot, and the problem addressed. Thanks for you reply.
Environment:
2 nodes with A100 GPUs intra-connected with PCIe Gen 4 and NVLink, inter-connected with four 200Gb RoCEv2 NICs. NCCL version is: 2.11.4 CUDA version is: 11.6 The result of ib_write_bw is about 180Gb GPU Topo:
nccl-tests command:
As the result shows, the average bus bandwidth is only 9.84 GB/s (about 80Gbps), which is far away from the link bandwidth.
Any suggestions?