NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
2.95k stars 755 forks source link

Why duplicate nChannels in connect.cc #1302

Open jxh314 opened 1 month ago

jxh314 commented 1 month ago

Hello, when I read the source code of nccl v2.14.3, I found that all of the nchannels have been duplicated. https://github.com/NVIDIA/nccl/blob/c4e2aa6c792b4e94d0343c72ce20e71285238827/src/graph/connect.cc#L65-L68 May I ask what the purpose of doing this is? By referring to this issue #578, I have gained some understanding, but I am still not very clear about the meaning of 'bubble'. Could you explain it a bit?

visualxu commented 4 weeks ago

For example, the speed of a single channel (single gpu SM) is 24GB/s, but the speed of a nic is 400GBps (50GB/s), which is called a bubble. If different rings are used to match NCI speed, it may lead to performance degradation.