Running ddppo baselines with MPI

asolano commented 4 years ago

Greetings,

I am trying to run the distributed version of the PPO baselines training on a cluster that has SGE instead of SLURM, so the included script cannot be used.

I tried replacing srun with mpirun but unsurprisingly it did not work. ("Address already in use" error)

  File "/home/acb11899xv/habitat-api/habitat_baselines/rl/ddppo/algo/ddp_utils.py", line 160, in init_distrib_slurm
    trainer.train()
  File "/home/acb11899xv/habitat-api/habitat_baselines/rl/ddppo/algo/ddppo_trainer.py", line 138, in train
    main()
  File "/home/acb11899xv/habitat-api/habitat_baselines/run.py", line 39, in main
    self.config.RL.DDPPO.distrib_backend
  File "/home/acb11899xv/habitat-api/habitat_baselines/rl/ddppo/algo/ddp_utils.py", line 160, in init_distrib_slurm
    run_exp(**vars(args))
  File "/home/acb11899xv/habitat-api/habitat_baselines/run.py", line 64, in run_exp
    master_addr, master_port, world_size, world_rank == 0
RuntimeError: Address already in use
    master_addr, master_port, world_size, world_rank == 0
RuntimeError: Address already in use
    trainer.train()
  File "/home/acb11899xv/habitat-api/habitat_baselines/rl/ddppo/algo/ddppo_trainer.py", line 138, in train
2020-08-20 14:04:33,829 Initializing dataset PointNav-v1
    self.config.RL.DDPPO.distrib_backend
  File "/home/acb11899xv/habitat-api/habitat_baselines/rl/ddppo/algo/ddp_utils.py", line 160, in init_distrib_slurm
    master_addr, master_port, world_size, world_rank == 0
RuntimeError: Address already in use
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

It seems ddp_utils.py is tailored for SLURM, has there been any effort to do the same for MPI?

Thanks,

Alfredo

Skylion007 commented 4 years ago

While it will work automatically in SLURM, it will also work for MPI as long as the corresponding environment variables are set: LOCAL_RANK, etc... per PyTorch standards. Any distributed training script that settings those variables properly for PyTorch should work. Open to PRs to get it work automatically with MPI variables though.

erikwijmans commented 4 years ago

I don't have access to a cluster where mpirun works correctly, so I can't check this, but the mapping for MPI environment variables should be pretty straightforward. Adding

    elif os.environ.get("OMPI_COMM_WORLD_SIZE", None) is not None:
        local_rank = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
        world_rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
        world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])

to ddp_utils.py should work.

asolano commented 4 years ago

Thanks, Erik, that was it. With those lines added we managed to run a short training job with multiple nodes.

facebookresearch / habitat-lab

Running ddppo baselines with MPI #457