Open CraneQinghe opened 4 months ago
This is caused by the last operation in the ring allreduce (buff ->output): https://github.com/NVIDIA/nccl/blob/master/src/device/all_reduce.h#L78 This operation is a dtod (not a p2p in write mode), even if we can overlap this operation with the next send (which is P2P). The last copy has a greater impact on fewer GPU cards. So you may see that the peak performance of ring allreduce 2gpu is lower than the peak performance of 8gpus.
I'm studying the relation between deep-learing jobs' communication time and the number of GPUs N,I use pytorch profiler as tool to track the job's running trace.And I find that the jobs' nccl ring all reduce stream duration doesn't scales with theoretical (N-1)/N.It's increasing ratio is lower than the theoretical (N-1)/N.And I don't know what was going on. If I used a wrong way to measure the communication time? If the nccl ring all reduce stream duration is not equal to communication time? I guess if the nccl ring all reduce stream contains the time that waiting for backward propagation to fill the bucket?Who can give me a answer?Thank you