Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

Mysterious behavior of requiring setting CUDA device explicitly in OpenMPI 1.10.7 #15

Closed hanbinhu closed 4 years ago

hanbinhu commented 4 years ago

To run CUDA-aware MPI_Allreduce in OpenMPI 1.10.7, we must explicitly set CUDA device using cudaSetDevice function in cuda_runtime.h, no matter how many GPUs are available on the node.

For example, when there is only one GPU, using the following code: cudaGetDevice(&device); // device <= 0, suppose to be zero when there is only one GPU cudaSetDevice(device); MPI_Allreduce(...) // Without the previous line, it will report cuMemcpy error.

BichengYing commented 4 years ago

Still unknown. But after switching to open-mpi 4.0 no more this problem. So close it?

BichengYing commented 4 years ago

Never mind. No longer using open-mpi < 4.0