Open Willy0919 opened 4 years ago
@Willy0919 Please manually specify --dist-url
(with a different port) in the training command line.
@tianzhi0549 I have done this as described:
python tools/train_net.py
--config-file configs/FCOS-Detection/R_50_1x.yaml
--num-gpus 4 OUTPUT_DIR training_dir/fcos_R_50_1x_3d_ctr_real
--dist-url tcp://127.0.0.1:50001
but it did not work.
Please place --dist-url tcp://127.0.0.1:50001
before options, for example.
python tools/train_net.py \
--config-file configs/FCOS-Detection/R_50_1x.yaml \
--dist-url tcp://127.0.0.1:50001 \
--num-gpus 4 OUTPUT_DIR training_dir/fcos_R_50_1x_3d_ctr_real
Hi, I meet the same problem, have you solved?
When I run a program using multi GPUs, the code can be trained correctly. But if I opened another similar program, which only changed a few params, the RuntimeError was encountered. Even when I assigned new dist-url, the print information seemed that the dist-url was not changed:
Command Line Args: Namespace(config_file='configs/FCOS-Detection/R_50_1x.yaml', dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=4, num_machines=1, opts=['OUTPUT_DIR', 'training_dir/fcos_R_50_1x_3d_ctr_real', '--dist-url', 'tcp://127.0.0.1:50001'], resume=False) Process group URL: tcp://127.0.0.1:50152 Traceback (most recent call last): File "/home/wl/code/AdelaiDet/tools/train_net.py", line 243, in
args=(args,),
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 54, in launch
daemon=False,
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 72, in _distributed_worker raise e File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/detectron2/engine/launch.py", line 67, in _distributed_worker backend="NCCL", init_method=dist_url, world_size=world_size, rank=global_rank File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 393, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/wl/miniconda3/envs/py3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: Address already in use