I was wondering what are the termination conditions for nccl kernels like AllReduce, AllGather, ReduceScatter etc.
For AllReduce, It seems there are multiple phases like Send, RecvReduceSend, RecvReduceCopySend, RecvCopySend, and RecvCopy
My question is, do nccl kernels participating in the same collective synchronize their termination?
If not, once the required receive, send, and reduce operations are completed on one node, does it shut down independently of the NCCL kernels participating in the other nodes?
Wouldn't it waste GPU resources if the NCCL kernel completes all the necessary operations and still waits for the other nodes to terminate?
Hi all
I was wondering what are the termination conditions for nccl kernels like AllReduce, AllGather, ReduceScatter etc.
For AllReduce, It seems there are multiple phases like
Send
,RecvReduceSend
,RecvReduceCopySend
,RecvCopySend
, andRecvCopy
My question is, do nccl kernels participating in the same collective synchronize their termination?
If not, once the required receive, send, and reduce operations are completed on one node, does it shut down independently of the NCCL kernels participating in the other nodes?
Wouldn't it waste GPU resources if the NCCL kernel completes all the necessary operations and still waits for the other nodes to terminate?
Best regards Taekyoung