To run CUDA-aware MPI_Allreduce in OpenMPI 1.10.7, we must explicitly set CUDA device using cudaSetDevice function in cuda_runtime.h, no matter how many GPUs are available on the node.
For example, when there is only one GPU, using the following code:
cudaGetDevice(&device); // device <= 0, suppose to be zero when there is only one GPU
cudaSetDevice(device);
MPI_Allreduce(...) // Without the previous line, it will report cuMemcpy error.
To run CUDA-aware MPI_Allreduce in OpenMPI 1.10.7, we must explicitly set CUDA device using cudaSetDevice function in cuda_runtime.h, no matter how many GPUs are available on the node.
For example, when there is only one GPU, using the following code: cudaGetDevice(&device); // device <= 0, suppose to be zero when there is only one GPU cudaSetDevice(device); MPI_Allreduce(...) // Without the previous line, it will report cuMemcpy error.