Closed BichengYing closed 4 years ago
It should not be related to Open MPI CUDA implementation. Since when we added sleep(0.001) after the casting tensor to cuda tensor, no error will occur. Very likely, this issue is because the GPU tensor allocation is not Sync between Torch and OpenMPI and Bluefog threads. Also, it might be related to cache issues?
The ready_event seems to be very important to this. Since originally MPI and PyTorch don't communicate with each other, MPI is not sure whether the CUDA memory is prepared by PyTorch or not even with non_blocking=false. After adding the ready_event (inserting artificial CUDA event to the CUDA stream), the synchronization between MPI and PyTorch is completed.
When testing with pair_gossip, it has random failure when test with DoubleTesnor. Currently, we don't have unit test cover torch.cuda.DoubleTensor well yet. Need some inputs on this.