Closed asolano closed 3 years ago
While it will work automatically in SLURM, it will also work for MPI as long as the corresponding environment variables are set: LOCAL_RANK, etc... per PyTorch standards. Any distributed training script that settings those variables properly for PyTorch should work. Open to PRs to get it work automatically with MPI variables though.
I don't have access to a cluster where mpirun
works correctly, so I can't check this, but the mapping for MPI environment variables should be pretty straightforward.
Adding
elif os.environ.get("OMPI_COMM_WORLD_SIZE", None) is not None:
local_rank = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
world_rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])
to ddp_utils.py
should work.
Thanks, Erik, that was it. With those lines added we managed to run a short training job with multiple nodes.
Greetings,
I am trying to run the distributed version of the PPO baselines training on a cluster that has SGE instead of SLURM, so the included script cannot be used.
I tried replacing
srun
withmpirun
but unsurprisingly it did not work. ("Address already in use" error)It seems
ddp_utils.py
is tailored for SLURM, has there been any effort to do the same for MPI?Thanks,
Alfredo