OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.77k stars 2.25k forks source link

"RuntimeError: Address already in use" while running multiple tasks #1298

Closed memray closed 5 years ago

memray commented 5 years ago

Hi,

I was trying to run two OpenNMT tasks on a single node at the same time. I set two tasks as -world_size 2 -gpu_ranks 0 1, and with different CUDA_VISIBLE_DEVICES. I reinstall the PyTorch from source (ver=1.1.0a0+c3f5ba9) but problem remains. The error is:

  File "/home/mengr/project/kp/OpenNMT-kpg/train.py", line 59, in run
    gpu_rank = onmt.utils.distributed.multi_init(opt, device_id)
  File "/home/mengr/project/kp/OpenNMT-kpg/onmt/utils/distributed.py", line 27, in multi_init
    world_size=dist_world_size, rank=opt.gpu_ranks[device_id])
  File "/home/mengr/.conda/envs/kp_py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 354, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/home/mengr/.conda/envs/kp_py36/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 95, in _tcp_rendezvous_handler
    store = TCPStore(result.hostname, result.port, world_size, start_daemon)
RuntimeError: Address already in use

Thanks, Rui

vince62s commented 5 years ago

has been addressed nefore, you need to change the port on one of the tasks. check here: https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/opts.py#L329

memray commented 5 years ago

Cool! It solved my problem. Thanks!

Rui

PhaelIshall commented 5 years ago

@vince62s I am having the same problem. How did you fix it? Not sure how to change the port on one of the tasks. Please let me know! @memray

memray commented 5 years ago

@vince62s Just as Vincent suggested, I set different master_port for each experiment (say 10000 for your exp1 with GPU 0,1 and 10001 for your exp2 with GPU 2,3), and it works.