Open osayamenja opened 3 months ago
@AddyLaddy Please let me know your thoughts.
Using ncclGroupStart/ncclGroupEnd is key here. When wrapping multiple calls inside a group, then all calls are fused together logically, and you can create inter-GPU communication patterns which can happen concurrently.
@sjeaugey Thank you! So in that case are the send/recv calls no longer blocking? Another question would be can you explain precisely what happens as NCCL executes the fused All-to-All operation?
The fused call of sends and recvs is still blocking. Individual sends and receives are not within the global fused operation.
Thanks again, when you say the fused call is blocking, do you mean that the operation does not complete until all of its send/recv calls complete? If so, how does NCCL keep track of which is completed or not?
As stated here,
ncclSend
andncclRecv
are "blocking" to the GPU and CPU, it seems. Given these semantics, I cannot wrap my head around howAlltoAll
as implemented in the linked documentation or here works without deadlock. Obviously, this implementation works fine in practice, hence my question.For clarity, I will explain my mental model of this operation and hopefully someone can point out where it goes wrong.
AlltoAll
[^1]ncclSend
to GPU 1, GPU 0 and GPU 0, respectively.ncclRecv
issued by GPU 1 for GPU 0, GPU 0 for GPU 1 and GPU 0 for GPU 2.recv
from GPU 0 and vice versa.Please help me clarify!
[^1]: Let's also assume the concurrent implementation across channels as described in GTC '24 S61368. [^2]: For simplicity, let's not consider the iteration where a GPU sends/receives to and from itself.