NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
2.97k stars 758 forks source link

Questions on Tree Structures #448

Open szhengac opened 3 years ago

szhengac commented 3 years ago

Hi, can anyone elaborate more on the difference between the following three Tree structures?

https://github.com/NVIDIA/nccl/blob/399656269027c1818fc999ccf8ec4dd838cec50d/src/include/graph.h#L55-L57

Also, what does the following constant stand for

https://github.com/NVIDIA/nccl/blob/399656269027c1818fc999ccf8ec4dd838cec50d/src/include/graph.h#L50

Based on the usage of graph->intra in other parts of the codebase, I thought it is the maximum number of GPUs. But 256 is too small, so I am confused.

Thanks.

sjeaugey commented 3 years ago

The different types of trees are with repect to how we connect intra-node ranks to the inter-node tree. The inter-node tree is always the same. Intra-node, when we have 2 GPUs close to the NIC, we can choose which GPU will send or receive, so we have different options to balance PCI traffic and/or reduction computing load.

The NCCL_TOPO_MAX_NODE constant is the maximum number of nodes of one type in the node topology graph. So we support up to 256 GPUs (per node), 256 NICs (per node), 256 PCI switches, 256 NUMA nodes, ...

szhengac commented 3 years ago

Thanks for responding. Is the double binary tree mentioned in https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/ equivalent to any type of the tree here?

sjeaugey commented 3 years ago

The double binary tree concerns inter-node communication only. All three use it.

szhengac commented 3 years ago

Then how we aggregate the data for the intra ranks? Another intra-node tree or ring allreduce?

sjeaugey commented 3 years ago

Well, we also have an intra-node chain which converges to the NIC, adding a third branch to the tree (not shown in the double tree). How that chain is connected to the inter-node double tree is what makes the three variants :

szhengac commented 3 years ago

Thanks. This is much clearer to me. One last question, do we have additional dummy root node (e.g., rank 0 in the first tree of https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/ ) in the double binary tree when #nodes is an odd number? If not, it seems to me that busbw = 1.5 * algbw.