NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

Why nccl ring all reduce stream duration doesn't scales with theoretical (N-1)/N? #1282

Open CraneQinghe opened 4 months ago

CraneQinghe commented 4 months ago

I'm studying the relation between deep-learing jobs' communication time and the number of GPUs N,I use pytorch profiler as tool to track the job's running trace.And I find that the jobs' nccl ring all reduce stream duration doesn't scales with theoretical (N-1)/N.It's increasing ratio is lower than the theoretical (N-1)/N.And I don't know what was going on. If I used a wrong way to measure the communication time? If the nccl ring all reduce stream duration is not equal to communication time? I guess if the nccl ring all reduce stream contains the time that waiting for backward propagation to fill the bucket?Who can give me a answer?Thank you

visualxu commented 3 months ago

This is caused by the last operation in the ring allreduce (buff ->output): https://github.com/NVIDIA/nccl/blob/master/src/device/all_reduce.h#L78 This operation is a dtod (not a p2p in write mode), even if we can overlap this operation with the next send (which is P2P). The last copy has a greater impact on fewer GPU cards. So you may see that the peak performance of ring allreduce 2gpu is lower than the peak performance of 8gpus.