NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 829 forks source link

what does non-blocking communicator for? #1346

Open CtfGo opened 4 months ago

CtfGo commented 4 months ago

Hi, all

I find there is an option blocking in ncclConfig_t, the official document declare that

blocking” can be set to 0 to ask NCCL to never block in any NCCL call

There is also an example show how this option affect the ncclCommInitRankConfig process.

Here are a few of my questions:

  1. What does the non-blocking mean for?is it any Nccl call on the communicator would no longer block CPU, neither device synchronization?
  2. Take ncclSendncclRecv as examples, in the document, both of them are declared that they will block for the GPU, what will change if I create a non-blocking communicator for them?
sjeaugey commented 4 months ago
  1. yes that's the idea. non-blocking is on the CPU side, to allow the application to call ncclCommAbort in case of a deadlock.
  2. No, the GPU-side isn't affected. Only the CPU-side, which in the case of send/recv, may include creating connections with other peers, and could therefore lead to hangs.
CtfGo commented 4 months ago

Thanks for your quick reply! @sjeaugey, and I have another further questions to be confirmed ^^:

  1. What nccl APIs will block CPU side in default ? are these APIs that have no param cudaStream_t? otherwise, is there any rule we can recognize them?

2. No, the GPU-side isn't affected. Only the CPU-side, which in the case of send/recv, may include creating connections with other peers, and could therefore lead to hangs.

  1. creating connections with other peers does this happen before every time launching the nccl send/recv kernel? and it is a blocking CPU-side behavior in default?
sjeaugey commented 4 months ago

What nccl APIs will block CPU side in default ? are these APIs that have no param cudaStream_t?

The non-blocking attribute concerns all NCCL calls, for their CPU side. Init/Finalize, of course, are purely CPU based, but even ncclSend or ncclAllreduce may need to establish connections before the GPU kernel is launched and may block. So setting the communicator to non-blocking will tell NCCL to not block on the CPU call and return ncclInProgress if it would block.

CtfGo commented 4 months ago

I understand now. Thank you very much for your careful explanation !