NVIDIA / semantic-segmentation

Nvidia Semantic Segmentation monorepo
BSD 3-Clause "New" or "Revised" License
1.76k stars 388 forks source link

RuntimeError: Address already in use #180

Open noparkee opened 2 years ago

noparkee commented 2 years ago

I tried to run this model to evaluate dumy folders at the same time with one GPU (A100) which has 80G.

When I tried to evaluate one folder, it works well. However, if I try to run the other one additionally, an error appears. It seems that there is a problem with the pytorch.distributed package. When I googled it, people said if I change the port number, this problem will be solved. Do you know how to change the port number in this code?

error message

None
Global Rank: 0 Local Rank: 0
Killing subprocess 659577
Traceback (most recent call last):
  File "train.py", line 299, in <module>
    torch.distributed.init_process_group(backend='nccl',
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--dataset', 'cityscapes', '--cv', '0', '--syncbn', '--apex', '--fp16', '--bs_val', '1', '--eval', 'folder', '--eval_folder', '/workspace/lyft_trainval_images', '--dump_assets', '--dump_all_images', '--n_scales', '0.5,1.0,2.0', '--snapshot', 'large_asset_dir/seg_weights/cityscapes_ocrnet.HRNet_Mscale_outstanding-turtle.pth', '--arch', 'ocrnet.HRNet_Mscale', '--result_dir', 'logs/dump_folder/frisky-serval_2022.06.17_17.15']' returned non-zero exit status 1.