NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.22k stars 809 forks source link

Why does NVLSTree Allreduce perform worse than Ring Allreduce? #1362

Open MoringKing opened 3 months ago

MoringKing commented 3 months ago

When the total number of GPUs is large and the message size is 1, if the NVLSTree algorithm is specified, the execution time for Allreduce in NCCL can be as high as 300ms. However, if the Ring algorithm is specified, the execution time for Allreduce in NCCL is only 8ms. Why? Does this mean that for multi-GPU and small message size, the NVLSTree algorithm should not be chosen for Allreduce?

cyqmonkey commented 3 months ago

For small message size, the latency of Ring LL algorithm is lower than that of NVLSTree.