Open YanjieGao opened 1 month ago
Is the design of send related to the MPI buffered mode send operation? But this is inconsistent with the nccl documentation describing the synchronization semantics of send.
https://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node53.htm
I'm doing some communication synchronization tests recently and found that send in PyTorch uses nccl backend as non blocking. It seems that the send CUDA kernel is not blocking the execution and waiting for the peer receive operator. The recv kernel is blocking. This is inconsistent with the description in Pytorch and nccl documentation. What's the reason?
Reproduce examples (I am trying to create a test case to trigger a deadlock in communication. Gloo will hang at send call, NCCL backend will hang at recv call.):