Network Topology awareness

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Other

3.26k stars 826 forks source link

Network Topology awareness #1340

Open liranschour opened 4 months ago

liranschour commented 4 months ago

Hi,

Looking at the code inside src/graph/connect.cc, it seems that the connection of the node's rings into the global rings is done only according to the order of the nodes list (which is defined by MPI itself). This means that the code does not consider network topology between the nodes when creating the global ring. This might lead to sub-optimal rings with increased number of hops and increased risk for traffic flow interference. Is this observation correct?

I am interested to contribute new functionality that will take the network latency/number of hops between nodes into account when connecting rings and trees.

Is anybody else active in this area?

Thanks,

Liran

jiangzhuti commented 4 months ago

I am interested in this. Due to the complexity of the network topology, it is difficult to design a universal configuration method. In fact, this problem can be solved by users in a suitable way at the node scheduling level and node arrangement level according to the characteristics of their own systems. And I am worried that after doing this operation inside NCCL, it will be more difficult to debug the communication because the communication order becomes more difficult to determine.

sjeaugey commented 4 months ago

I'm not pushing for network topology detection either, as this would complexify the code greatly. The current approach is based on simplicity:

Allocate nodes in the most contiguous way.
Number nodes in the order of the hardware.
Use algorithms which maximize communication with neighbors (avoid communicating with ranks which are far from our rank as much as possible).

That strategy works great, is universal, and doesn't require complex integration in the job scheduler / launcher / communication library / network fabric.