This pull requests addresses an issue where several MPI processes creating CUDA devices simultaneously could cause CUDA functions to fail. When run under MPI, the CudaDevice constructor is called in a loop synchronized by MPI_Barrier, and each process creates the device only when the loop counter is equal to its global rank.
This pull requests addresses an issue where several MPI processes creating CUDA devices simultaneously could cause CUDA functions to fail. When run under MPI, the CudaDevice constructor is called in a loop synchronized by MPI_Barrier, and each process creates the device only when the loop counter is equal to its global rank.