Future work on "Inter-GPU Communication with CUDA-aware MPI"

Hi NCCL team, thanks for your work on this great library.

This issue is in regard to a section in the documentation "Inter-GPU Communication with CUDA-aware MPI"

My summary of this is that because NCCL uses blocking kernels which create dependencies between devices, any other framework that creates separate dependencies between devices concurrently risks deadlock.

Is there any future work planned to resolve this incompatibility? My workloads use both NCCL and OpenMPI; I've seen nontrivial performance improvement on tests with CUDA-aware MPI.

If there is no such future work, do you have any recommendations on workarounds? If OpenMPI had functionality to limit their operations which enforce cross-device dependencies to some epoch window, could NCCL work as long as it operated in a separate epoch window?

NVIDIA / nccl

Future work on "Inter-GPU Communication with CUDA-aware MPI" #493