Open wuyujiji opened 3 years ago
Why did you set NCCL_NET_GDR_LEVEL=0
? That would disable GPU Direct RDMA and affect negatively performance.
Other than that, I think we've had reports of rings being suboptimal when the NIC was close to GPUs 4-5-6-7 rather than 0-1-2-3. You could probably set CUDA_VISIBLE_DEVICES=4,5,6,7,0,1,2,3
to solve the problem, but it's not impossible NCCL chooses a suboptimal order otherwise. Now performance should be close to 7-8 GB/s, not 1-2 GB/s, which is probably because GPU DirectRDMA is disabled.
Thanks for your quickly reply! the GDR result's is about 8GB. The reason for setting NCCL_NET_GDR_LEVEL=0
because we have scenarios where GDR cannot be turned on。The inter-communication between machines is similar to the PS architecture, so we want to use nccl to measure the performance of the cluster rdma network.
When I set CUDA_VISIBLE_DEVICES=4,5,6,7,0,1,2,3
, the result is also 1.7~1.8GB
Actually we want to know why the impact will be so big after changing the ring from 4->5->6->7->3->2->1->0
to 3->2->1->0->4->5->6->7
?
@sjeaugey Looking forward to your answer~
Changing the ring means flows will be in a different direction. NIC->local memory .... distant memory->NIC versus NIC->distant memory ... local memory -> NIC. I guess that may explain the performance difference. Now this is not a case we often optimize for, hence we don't have much experience with this. We assume if there is a PCI switch, users will want to use GDR as it would give much higher performance (as is the case here).
And I'm not sure I get how using NCCL with GDR disabled will measure the performance of the cluster RDMA network; it seems to me the bottleneck would be on the CPU side rather than on the NIC.
Excuse me!Recently I want to test the back-to-back (without going through the switch) nccl-test's (disabled GDR) result of 2 machines 16 gpus and meet a very strange phenomenon。Details as follows:
NCCL Version:v2.7.8-1 CUDA Version: 10.0 or 11.0 Machine: PCIe-V100
1. The topo of the two machines (both PCIe-V100) are the same, as shown below:
2. Run nccl-test using nccl's default build ring method
command as follows:
the nccl info channel is:
the bandwidth is only about 1.8GB,shown as follows:
3. Run nccl-test using self-defined build ring method
In order to verify whether it is caused by the default building ring, I customize the
graph.txt
and change the nccl info channel(before the gpu order is: 4->5->6->7->3->2->1->0 and now is: 3->2->1->0->4->5->6->7 ):command:
graph.txt:
the nccl channel of self-defined way is:
However, the result is up to 5GB, and I do not know why.
4. Re-verify the results of the default buiding ring replacement by using a self-define building ring
I doubt the strange phenomenon whether it is caused by parameters (
-x NCCL_ALGO=Ring -x NCCL_DEBUG_SUBSYS=INIT,P2P,GRAPH,ENV -x NCCL_GRAPH_FILE=./graph.txt
). Therefore, I use self-defined way to bulding the default the channel, namely is4->5->6->7->3->2->1->0
. However, the result is the same as before, only about 1.8GBgraph.txt:
the channel is: