NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.13k stars 788 forks source link

Questions about the relationship between Topology and Progress Threads #886

Open fyf2016 opened 1 year ago

fyf2016 commented 1 year ago

Hi Sylvain, When I was experimenting with NCCL recently, I was confused about the relationship between Progress threads and GPU topology. During network communication, will each GPU have a Progress thread? Or is the Progress thread related to the topology, for example, if there are two rings, and each ring has a network card and a GPU, then there are two Progress threads?

sjeaugey commented 1 year ago

In very old versions of NCCL, we used to have one progress thread per ring per NIC. But that was a long time ago. Now we have one progress thread per rank (GPU) because that's all we need when using RDMA APIs like IB Verbs. If needed, network plugins can implement multi-threaded progress to reach the desired performance.

fyf2016 commented 1 year ago

@sjeaugey thanks a lot for your response.

Below is my experimental configuration and problem. I used two machines to conduct NCCL experiments. Both machines have the same topology. The topology diagram is as follows: Topology I use these two machines to execute the all_reduce_perf task, and I specify the graph structure of the two machines, and the specified graph is as follows: img_v2_13a1930f-ecca-4ecf-bd75-28b7e988c17g At the same time, I also printed the actual graph used, which is as follows: img_v2_8a1ea58e-cf7a-4e14-bf0f-65897f80ab2g It can be seen that the graph actually used is the same as the graph I specified. But there are some problems during the actual implementation: The first problem is that the traffic sent and received between the two network cards is uneven, even extreme, as shown in the following figure: img_v2_129a4894-5392-43e7-9576-162197bb906g it seems that the eth0 did not transmit data in the test. Does that mean the two channels use the eth1 to transmit data? The second problem is that there are two Progress threads at the beginning, but the cpu utilization of another Progress thread suddenly drops to 0, and finally only one Progress thread runs until it ends. As shown below: img_v2_c4a5c44f-730d-4ac7-b9fb-1b7580c8d66g In order to facilitate you to locate the problem, I also printed out system.xml, as shown in the following figure: img_v2_f602c3f7-b4b8-44a5-8d28-71889db5528g

sjeaugey commented 1 year ago

What are the IP addresses of your two NICs? Are they in the same subnet? If so, could it be your routing tables tell the kernel to use eth1 as the interface for that subnet? That would explain why both NICs receive data but only one sends data.

fyf2016 commented 1 year ago

@sjeaugey thanks a lot for your followup response. They are in the same subnet, and the routing table is configured with 2 NIC network cards. As shown below:

image

I speculate that there may be other reasons.

sjeaugey commented 1 year ago

Ok, so indeed that's the reason. When the kernel wants to send data to a destination IP, it looks in the routing table and picks the first route that satisfies the destination IP. So in your case that will always be eth1. You need to use different subnets, or have rules that indicate that to target a specific NIC on the other nodes you need to use a specific NIC on the local node.

fyf2016 commented 1 year ago

I understand, thank you, all-powerful sjeaugey.