When the total number of GPUs is large and the message size is 1, if the NVLSTree algorithm is specified, the execution time for Allreduce in NCCL can be as high as 300ms. However, if the Ring algorithm is specified, the execution time for Allreduce in NCCL is only 8ms. Why? Does this mean that for multi-GPU and small message size, the NVLSTree algorithm should not be chosen for Allreduce?
When the total number of GPUs is large and the message size is 1, if the NVLSTree algorithm is specified, the execution time for Allreduce in NCCL can be as high as 300ms. However, if the Ring algorithm is specified, the execution time for Allreduce in NCCL is only 8ms. Why? Does this mean that for multi-GPU and small message size, the NVLSTree algorithm should not be chosen for Allreduce?