NVIDIA / semantic-segmentation

Nvidia Semantic Segmentation monorepo
BSD 3-Clause "New" or "Revised" License
1.78k stars 388 forks source link

RuntimeError: No rendezvous handler for env:// on Windows #120

Open divastar opened 3 years ago

divastar commented 3 years ago

Hi. I am on windows 10 How can I solve the: RuntimeError: No rendezvous handler for env:// problem?

Traceback (most recent call last): File "train.py", line 299, in torch.distributed.init_process_group(backend='nccl', File "C:\Users\korin\anaconda3\envs\myenv\lib\site-packages\torch\distributed\distributed_c10d.py", line 433, in init_process_group rendezvous_iterator = rendezvous( File "C:\Users\korin\anaconda3\envs\myenv\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous raise RuntimeError("No rendezvous handler for {}://".format(result.scheme)) RuntimeError: No rendezvous handler for env:// Traceback (most recent call last): File "train.py", line 298, in torch.cuda.set_device(args.local_rank) File "C:\Users\korin\anaconda3\envs\myenv\lib\site-packages\torch\cuda__init__.py", line 263, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): File "C:\Users\korin\anaconda3\envs\myenv\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\korin\anaconda3\envs\myenv\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\Users\korin\anaconda3\envs\myenv\lib\site-packages\torch\distributed\launch.py", line 260, in main() File "C:\Users\korin\anaconda3\envs\myenv\lib\site-packages\torch\distributed\launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['C:\Users\korin\anaconda3\envs\myenv\python.exe', '-u', 'train.py', '--local_rank=1', '--dataset', 'cityscapes', '--cv', '0', '--syncbn', '--apex', '--fp16', '--bs_val', '1', '--eval', 'folder', '--eval_folder', './imgs/test_imgs', '--dump_assets', '--dump_all_images', '--n_scales', '0.5,1.0,2.0', '--snapshot', 'ASSETS_PATH/seg_weights/cityscapes_ocrnet.HRNet_Mscale_outstanding-turtle.pth', '--arch', 'ocrnet.HRNet_Mscale', '--result_dir', 'logs\dump_folder\singing-earwig_2021.02.21_07.56']' returned non-zero exit status 1.

ycwang-libra commented 3 years ago

I have the same question

Lqqqying commented 3 years ago

I have the same problem on windows10

ljz756245026 commented 3 years ago

python -m torch.distributed.launch xxx.py

use the above command to run this .py file on cmd window.

herene commented 3 years ago

windows systerm does not support ddp, just comment out "accelerator="ddp"" in train.py

herene commented 3 years ago

also,

windows systerm does not support ddp, just comment out "accelerator="ddp"" in train.py

Also, I found the configuration file is incomplete, and the input_size does not correspond can lead to this situation.