NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.19k stars 804 forks source link

ncclCommSplit in non-blocking API mode #1472

Open kwen2501 opened 1 week ago

kwen2501 commented 1 week ago
ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t* newcomm, ncclConfig_t* config)

Question 1: When calling ncclCommSplit in non-blocking API mode, would the value (i.e. pointer) for newcomm be filled before the API returns?

Question 2: if we want to check if the split finishes, should we call ncclGetAsyncError on the parent comm or on the newcomm? Or either one is fine?

sjeaugey commented 1 week ago

I would think ncclCommSplit mimics the behavior of ncclCommInitRank. So:

Q1: yes, it should Q2: checking the status of newcomm should work, so that you can poll on it then use it for comm operations . But indeed we would also need to wait for the split to complete until we can reuse the parent comm, hence the status of the parent comm should be ncclInProgress as well until it's safe to call new operations on the parent comm.

I might be missing something though; let me check internally.

kwen2501 commented 1 week ago

Thanks @sjeaugey

Re Q1: Yeah, I had the same mental model / expectation. But I also saw some lines as below from PyTorch's log after using split in non-blocking mode:

[PG ID 1 PG GUID 1(default_pg:split:0) Rank 0] ProcessGroupNCCL created ncclComm_ 0 on CUDA device: 0

The "ncclComm_ 0" part seems to suggest that the value is not filled when ncclCommSplit returns. It may be also due to bugs in PyTorch, I can isolate further.

Re Q2: Good point about the parent comm.

sjeaugey commented 1 week ago

Ok. It looks like the current implementation is focusing on the parent comm and considering ncclCommSplit like a ncclAllReduce in that respect: the operation is not complete while the parent comm status is ncclInProgress and when it becomes ncclSuccess, then the newcomm is filled.

We'll see if we can improve the behavior to continue being consistent with the non-blocking semantics of the parent, but also be more consistent with the ncclCommInitRankConfig semantics.

I'm thinking about making the ncclCommSplit non-blocking only if both the parent comm and the new comm are non-blocking. Also, in non-blocking mode, the newcomm should be valid when we return, like for ncclCommInitRank. That seems to me to follow the constraints of both sides.

Of course we'd need to make the changes to confirm I'm not missing anything.

kwen2501 commented 1 week ago

Sounds good, thanks!

Also, in non-blocking mode, the newcomm should be valid when we return, like for ncclCommInitRank.

nit: in blocking mode.

sjeaugey commented 1 week ago

nit: in blocking mode.

Mmm, no, I meant in non-blocking more. In blocking mode when we return from ncclCommSplit, the new comm should be valid and fully initialized. In non-blocking mode, the new comm we return should be a valid object (as far as I understand, not the case currently), although probably not initialized yet (ncclInProgress).

Does that make sense?

kwen2501 commented 1 week ago

Oh, my bad understanding of "valid", I took it as "initialized". All good!