RuntimeError: Address already in use

Willy0919 commented 4 years ago

When I run a program using multi GPUs, the code can be trained correctly. But if I opened another similar program, which only changed a few params, the RuntimeError was encountered. Even when I assigned new dist-url, the print information seemed that the dist-url was not changed:

python tools/train_net.py 
--config-file configs/FCOS-Detection/R_50_1x.yaml 
--num-gpus 4 OUTPUT_DIR training_dir/fcos_R_50_1x_3d_ctr_real 
--dist-url tcp://127.0.0.1:50001

Command Line Args: Namespace(config_file='configs/FCOS-Detection/R_50_1x.yaml', dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=['OUTPUT_DIR', 'training_dir/fcos_R_50_1x_3d_ctr_real', '--dist-url', 'tcp://127.0.0.1:50001'], resume=False) Process group URL: tcp://127.0.0.1:50152 Traceback (most recent call last): File "/home/wl/code/AdelaiDet/tools/train_net.py", line 243, in args=(args,), File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 54, in launch daemon=False, File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 72, in _distributed_worker raise e File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 67, in _distributed_worker backend="NCCL", init_method=dist_url, world_size=world_size, rank=global_rank File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: Address already in use

tianzhi0549 commented 4 years ago

@Willy0919 Please manually specify --dist-url (with a different port) in the training command line.

Willy0919 commented 4 years ago

@tianzhi0549 I have done this as described:

python tools/train_net.py 
--config-file configs/FCOS-Detection/R_50_1x.yaml 
--num-gpus 4 OUTPUT_DIR training_dir/fcos_R_50_1x_3d_ctr_real 
--dist-url tcp://127.0.0.1:50001

but it did not work.

tianzhi0549 commented 4 years ago

Please place --dist-url tcp://127.0.0.1:50001 before options, for example.

python tools/train_net.py \
--config-file configs/FCOS-Detection/R_50_1x.yaml \
--dist-url tcp://127.0.0.1:50001 \
--num-gpus 4 OUTPUT_DIR training_dir/fcos_R_50_1x_3d_ctr_real

Ziyan0829 commented 2 years ago

Hi, I meet the same problem, have you solved?

aim-uofa / AdelaiDet

RuntimeError: Address already in use #149