Open pritamdamania87 opened 3 years ago
What NCCL version are you using? Sending to the same peer multiple times was not supported in NCCL 2.7. It should work with NCCL 2.8 onwards.
We're actually using 2.7: NCCL version 2.7.3, will try out 2.8. Btw, what is the latest stable release for NCCL? Is the latest release on this page always stable: https://github.com/NVIDIA/nccl/releases? (currently 2.9.6-1)?
This issue indeed seem to go away on NCCL 2.8.3 at least.
I was debugging the following issue in PyTorch with regards to nccl send/recv: https://github.com/pytorch/pytorch/issues/50092. I tried to see if I could somehow reproduce the issue in NCCL itself to isolate whether this is a nccl issue or a PyTorch implementation issue. I have shared my code which uses nccl send/recv below.
The interesting part is that when I remove the second
ncclRecv
and it's associated verification this works fine, but two ncclRecv don't seem to be working. Not sure if there is a bug in my code causing this. The error I see on rank 1 is the following (looks like recvbuff is all zeros):