Open fyf2016 opened 1 year ago
In very old versions of NCCL, we used to have one progress thread per ring per NIC. But that was a long time ago. Now we have one progress thread per rank (GPU) because that's all we need when using RDMA APIs like IB Verbs. If needed, network plugins can implement multi-threaded progress to reach the desired performance.
@sjeaugey thanks a lot for your response.
Below is my experimental configuration and problem. I used two machines to conduct NCCL experiments. Both machines have the same topology. The topology diagram is as follows: I use these two machines to execute the all_reduce_perf task, and I specify the graph structure of the two machines, and the specified graph is as follows: At the same time, I also printed the actual graph used, which is as follows: It can be seen that the graph actually used is the same as the graph I specified. But there are some problems during the actual implementation: The first problem is that the traffic sent and received between the two network cards is uneven, even extreme, as shown in the following figure: it seems that the eth0 did not transmit data in the test. Does that mean the two channels use the eth1 to transmit data? The second problem is that there are two Progress threads at the beginning, but the cpu utilization of another Progress thread suddenly drops to 0, and finally only one Progress thread runs until it ends. As shown below: In order to facilitate you to locate the problem, I also printed out system.xml, as shown in the following figure:
What are the IP addresses of your two NICs? Are they in the same subnet? If so, could it be your routing tables tell the kernel to use eth1 as the interface for that subnet? That would explain why both NICs receive data but only one sends data.
@sjeaugey thanks a lot for your followup response. They are in the same subnet, and the routing table is configured with 2 NIC network cards. As shown below:
I speculate that there may be other reasons.
Ok, so indeed that's the reason. When the kernel wants to send data to a destination IP, it looks in the routing table and picks the first route that satisfies the destination IP. So in your case that will always be eth1
. You need to use different subnets, or have rules that indicate that to target a specific NIC on the other nodes you need to use a specific NIC on the local node.
I understand, thank you, all-powerful sjeaugey.
Hi Sylvain, When I was experimenting with NCCL recently, I was confused about the relationship between Progress threads and GPU topology. During network communication, will each GPU have a Progress thread? Or is the Progress thread related to the topology, for example, if there are two rings, and each ring has a network card and a GPU, then there are two Progress threads?