NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.26k stars 827 forks source link

Fix comm ready check in ncclIbIsend and ncclIbIrecv #1489

Open WWeiOne opened 1 month ago

WWeiOne commented 1 month ago

ncclIbIsend & ncclIbIrecv contains redundant checks for comm->base.ready. These checks were made redundant due to a change in commit 5d3ab08, which removed a call to ncclSendCheck that might alter the readiness state of the communicator. The second check for comm->base.ready now will never executes.

Commit 5d3ab08 Feb 27 2023

  if (comm->ready == 0) NCCLCHECK(ncclSendCheck(comm));
  if (comm->ready == 0) { *request = NULL; return ncclSuccess; }

Also, the *request = NULL; should be kept, as none request is successfully added into the fifo slot, it's safer to assign NULL and consistent with the check if (sub->requests[buffSlot] != NULL) in sendProxyProgress.

kiskra-nvidia commented 1 month ago

Thank you for your submission! We'll consider it for a future release (it missed the 2.24 window).