NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.27k stars 829 forks source link

Selection of network devices in nccl #915

Open clearsky07 opened 1 year ago

clearsky07 commented 1 year ago

I want to know why network devices are chosen in this way in nccl/src/graph/search.cc/ncclTopoGetNetDev: // Honor the net device in the graph int channel = channelId%graph->nChannels; int ngpus = comm->topo->nodes[GPU].count; int index = graph->intra[channelngpus] == rank ? 0 : 1; dev = graph->inter[(channel*2+index)%ngpus]; What's the meaning of index,graph->intra,and graph->inter?Thanks a lot.

sjeaugey commented 1 year ago

graph->intra is the list of GPUs, graph->inter is the list of NICs (NIC to enter the node, NIC to exit the node).

So basically the flow for a ring would be NIC inter[0], GPU intra[0] .. GPU intra[n-1], NIC inter[1].