NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.11k stars 787 forks source link

Is it safe to start p2p send/recv on a communicator while another communicator is being initialized in another thread? #1082

Open SpiritedAwayCN opened 9 months ago

SpiritedAwayCN commented 9 months ago

My requirement is: The following tasks can be done asynchronously in a single process

Since these two tasks take considerable overhead, I want them to be executed asynchronously (or even simultaneously). Is it possible to assign them to different threads? Thanks!

sjeaugey commented 9 months ago

That's a good question. In general I'd think it should work, but there may be CUDA calls during ncclCommInitRank which could cause an implicit inter-device synchronization. If that is the case, then you could end up with a deadlock if:

SpiritedAwayCN commented 9 months ago

That's a good question. In general I'd think it should work, but there may be CUDA calls during ncclCommInitRank which could cause an implicit inter-device synchronization. If that is the case, then you could end up with a deadlock if:

  • the p2p communication launches on GPU A but not on GPU B
  • the init is blocking the launch on GPU B, waiting for GPU A to complete its CUDA work, including the NCCL operation which is stuck.

Thank you for your reply! Unfortunately, my batch p2p communication is very complicated, so the deadlock case #1 usually occurs in practice (during the call of ncclCommInitRank). I'm a bit confused about why the initialization will cause inter-device synchronization, shouldn't this initialisation just set network related parameters? And whether there are alternatives to achieve my needs?

sjeaugey commented 9 months ago

I'm a bit confused about why the initialization will cause inter-device synchronization

In theory it should not, and in NCCL 2.19 we have replaced a lot of CUDA calls to cuMem*, so the situation should improve, but we might still have some calls causing syncs, in particular when we share buffers between CUDA devices and map them on remote GPUs.