NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.27k stars 826 forks source link

Question about tree channel #1491

Open networkResearcher opened 1 month ago

networkResearcher commented 1 month ago

In our testing environment, we have 32 servers, each equipped with 4 GPUs and 4 NICs, and NVlink within each server. When we examined the all-reduce logs, we found that a total of 16 communication channels were created. Among these, channel-n (where n=0, 1, 2, 3, 4, 5, 6, 7) and channel-n+7 exhibit a double binary tree relationship. Additionally, channel-n (where n=0, 1, 2, 3) and channel-n+4 are identical. Could you please explain why two identical channels are set up in this way?