P2P Blocking Semantics and AlltoAll

osayamenja commented 3 months ago

As stated here, ncclSend and ncclRecv are "blocking" to the GPU and CPU, it seems. Given these semantics, I cannot wrap my head around how AlltoAll as implemented in the linked documentation or here works without deadlock. Obviously, this implementation works fine in practice, hence my question.

For clarity, I will explain my mental model of this operation and hopefully someone can point out where it goes wrong.

Consider GPU 0, 1, and 2 performing AlltoAll[^1]
Per the loop, the first[^2] iteration entails GPU 0, GPU 1 and GPU 2 simultaneously issuing a ncclSend to GPU 1, GPU 0 and GPU 0, respectively.
The calling thread/stream blocks pending a ncclRecv issued by GPU 1 for GPU 0, GPU 0 for GPU 1 and GPU 0 for GPU 2.
But naively this yields a deadlock between GPU 0 and GPU 1 as GPU 1 awaits a recv from GPU 0 and vice versa.

Please help me clarify!

[^1]: Let's also assume the concurrent implementation across channels as described in GTC '24 S61368. [^2]: For simplicity, let's not consider the iteration where a GPU sends/receives to and from itself.

osayamenja commented 2 months ago

@AddyLaddy Please let me know your thoughts.

sjeaugey commented 2 months ago

Using ncclGroupStart/ncclGroupEnd is key here. When wrapping multiple calls inside a group, then all calls are fused together logically, and you can create inter-GPU communication patterns which can happen concurrently.

osayamenja commented 2 months ago

@sjeaugey Thank you! So in that case are the send/recv calls no longer blocking? Another question would be can you explain precisely what happens as NCCL executes the fused All-to-All operation?

sjeaugey commented 2 months ago

The fused call of sends and recvs is still blocking. Individual sends and receives are not within the global fused operation.

osayamenja commented 2 months ago

Thanks again, when you say the fused call is blocking, do you mean that the operation does not complete until all of its send/recv calls complete? If so, how does NCCL keep track of which is completed or not?

NVIDIA / nccl

P2P Blocking Semantics and AlltoAll #1233