NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.2k stars 807 forks source link

Future work on "Inter-GPU Communication with CUDA-aware MPI" #493

Open cadedaniel opened 3 years ago

cadedaniel commented 3 years ago

Hi NCCL team, thanks for your work on this great library.

This issue is in regard to a section in the documentation "Inter-GPU Communication with CUDA-aware MPI"

My summary of this is that because NCCL uses blocking kernels which create dependencies between devices, any other framework that creates separate dependencies between devices concurrently risks deadlock.

Is there any future work planned to resolve this incompatibility? My workloads use both NCCL and OpenMPI; I've seen nontrivial performance improvement on tests with CUDA-aware MPI.

If there is no such future work, do you have any recommendations on workarounds? If OpenMPI had functionality to limit their operations which enforce cross-device dependencies to some epoch window, could NCCL work as long as it operated in a separate epoch window?

sjeaugey commented 3 years ago

Indeed, separating communication in different epochs is the best way to eliminate any chance of deadlock.

From the documentation:

Using NCCL to perform inter-GPU communication concurrently with CUDA-aware MPI may create deadlocks.

The important part here is "concurrently". As long as CUDA-aware MPI communication and NCCL communication don't happen in parallel, there is no problem.