Why nccl ring all reduce stream duration doesn't scales with theoretical (N-1)/N?

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Other

3.14k stars 791 forks source link

I'm studying the relation between deep-learing jobs' communication time and the number of GPUs N,I use pytorch profiler as tool to track the job's running trace.And I find that the jobs' nccl ring all reduce stream duration doesn't scales with theoretical (N-1)/N.It's increasing ratio is lower than the theoretical (N-1)/N.And I don't know what was going on. If I used a wrong way to measure the communication time? If the nccl ring all reduce stream duration is not equal to communication time? I guess if the nccl ring all reduce stream contains the time that waiting for backward propagation to fill the bucket?Who can give me a answer?Thank you

NVIDIA / nccl

Why nccl ring all reduce stream duration doesn't scales with theoretical (N-1)/N? #1282