Open YanjieGao opened 2 months ago
Indeed, there is no mention in the documentation that sizes have to match between ranks. That's obvious for anyone who's used MPI, but it's an oversight.
It's clear in the point-to-point communication:
Any point-to-point communication needs two NCCL calls : a call to ncclSend() on one rank and a corresponding ncclRecv() on the other rank, with the same count and data type.
And there is a hint at that rule in the MPI section:
MPI allows for different send and receive counts and types, as long as sendcount*sizeof(sendtype) == recvcount*sizeof(recvtype). NCCL does not allow that, defining a single count and a single data-type.
But indeed, it's not clearly mentioned. We'll fix that in the next documentation update.
Hi sjeaugey, Thanks for your reply. In addition, could you explain some rules for collective communication such as for allreduce, under what shape will it hang and when will it not? In addition, for some illegal cases in the test, if it is allowed not to hang in the case of shape mismatch, the calculated tensor tail result will be incorrect. It seems that an assert or hang should be given to prevent the user from getting the wrong numerical result without realizing it?
In addition, could you explain some rules for collective communication such as for allreduce, under what shape will it hang and when will it not?
It may hang as soon as the sizes differ. In some cases we'll end up with the same number of network transfers and it may work, but some data will be incorrect given there wasn't an input value for all ranks at that offset.
In addition, for some illegal cases in the test, if it is allowed not to hang in the case of shape mismatch, the calculated tensor tail result will be incorrect. It seems that an assert or hang should be given to prevent the user from getting the wrong numerical result without realizing it?
Doing so may require additional communication between ranks, which would impact performance negatively. When we detect something wrong we print it, but we can't always detect it without an extra cost.
Hi sjeaugey, thank you for your feedback. It would be very helpful for users if these assumptions could be stated in the documentation.
Absolutely. I fixed the doc, so it should be updated in the next release.
Hi, I have a similar problem to the one described in https://github.com/NVIDIA/nccl/issues/1394.
For allreduce communication with a shape mismatch (PyTorch with NCCL backend), some shapes may cause the process to hang, while others may not, with no predefined assumptions for synchronization semantics and no mention in the documentation. In cases of shape mismatch, incorrect results may be calculated at times. For example, if the shape of rank 0 is 409600 and the shape of ranks other than 0 is 409598, the last two results of rank 0 are 6.0 6.0 in test case 3.
Experimental results:
Test 1 log:
rank 0 shape is 409600, rank != 0 shape is 32768000 will hang
Test 2 log:
rank 0 shape is 409600, rank != 0 shape is 409600 will finish normally
Test 3 log:
rank 0 shape is 409600, rank != 0 shape is 409598 will finish
Test 4 log:
rank 0 shape is 409600, rank != 0 shape is 32768 will finish
Test code: