Failure on unit test with torch.cuda.DoubleTensor

Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph

https://bluefog-lib.github.io/bluefog/

Apache License 2.0

291 stars 71 forks source link

Failure on unit test with torch.cuda.DoubleTensor #37

Closed BichengYing closed 4 years ago

BichengYing commented 4 years ago

When testing with pair_gossip, it has random failure when test with DoubleTesnor. Currently, we don't have unit test cover torch.cuda.DoubleTensor well yet. Need some inputs on this.

Bluefog-Lib commented 4 years ago

It should not be related to Open MPI CUDA implementation. Since when we added sleep(0.001) after the casting tensor to cuda tensor, no error will occur. Very likely, this issue is because the GPU tensor allocation is not Sync between Torch and OpenMPI and Bluefog threads. Also, it might be related to cache issues?

hanbinhu commented 4 years ago

The ready_event seems to be very important to this. Since originally MPI and PyTorch don't communicate with each other, MPI is not sure whether the CUDA memory is prepared by PyTorch or not even with non_blocking=false. After adding the ready_event (inserting artificial CUDA event to the CUDA stream), the synchronization between MPI and PyTorch is completed.