NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 831 forks source link

NCCL kernels participating in the same collective synchronize their termination? #1337

Open taekyounghan opened 5 months ago

taekyounghan commented 5 months ago

Hi all

I was wondering what are the termination conditions for nccl kernels like AllReduce, AllGather, ReduceScatter etc.

For AllReduce, It seems there are multiple phases like Send, RecvReduceSend, RecvReduceCopySend, RecvCopySend, and RecvCopy

My question is, do nccl kernels participating in the same collective synchronize their termination?

If not, once the required receive, send, and reduce operations are completed on one node, does it shut down independently of the NCCL kernels participating in the other nodes?

Wouldn't it waste GPU resources if the NCCL kernel completes all the necessary operations and still waits for the other nodes to terminate?

Best regards Taekyoung