I tried to run this model to evaluate dumy folders at the same time with one GPU (A100) which has 80G.
When I tried to evaluate one folder, it works well. However, if I try to run the other one additionally, an error appears.
It seems that there is a problem with the pytorch.distributed package.
When I googled it, people said if I change the port number, this problem will be solved.
Do you know how to change the port number in this code?
error message
None
Global Rank: 0 Local Rank: 0
Killing subprocess 659577
Traceback (most recent call last):
File "train.py", line 299, in <module>
torch.distributed.init_process_group(backend='nccl',
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=0', '--dataset', 'cityscapes', '--cv', '0', '--syncbn', '--apex', '--fp16', '--bs_val', '1', '--eval', 'folder', '--eval_folder', '/workspace/lyft_trainval_images', '--dump_assets', '--dump_all_images', '--n_scales', '0.5,1.0,2.0', '--snapshot', 'large_asset_dir/seg_weights/cityscapes_ocrnet.HRNet_Mscale_outstanding-turtle.pth', '--arch', 'ocrnet.HRNet_Mscale', '--result_dir', 'logs/dump_folder/frisky-serval_2022.06.17_17.15']' returned non-zero exit status 1.
I tried to run this model to evaluate dumy folders at the same time with one GPU (A100) which has 80G.
When I tried to evaluate one folder, it works well. However, if I try to run the other one additionally, an error appears. It seems that there is a problem with the pytorch.distributed package. When I googled it, people said if I change the port number, this problem will be solved. Do you know how to change the port number in this code?
error message