Open dhruvampanchal opened 3 years ago
Are you planning on training the model only on 1 GPU? The error you are seeing is raised by one of the PyTorch functions used for the multi-GPU setup. In our setup, we could run with --num_process_per_node 1
without any issue probably because our system supported init_method="env://"
.
I have only 1 RTX 2080 Super MaxQ GPU right now. And I think it does not support init_method="env://"
.
I have a problem that How can I check whether my gpu supporting init_method="env://"
As per the readme file, I have changed some values in the arguments to make the model easier to train. Since, I have only one gpu, I have change num_process_per_node to 1.
But I am getting this error.
python train.py --data D:\Project\TAMU\Hierarchical_Variational_Autoencoders\NVAE#start-of-content\mnist --root D:\Project\TAMU\Hierarchical_Variational_Autoencoders\NVAE#start-of-content\CHECKPOINT\ --save 01 --dataset mnist --batch_size 200 --epochs 100 --num_latent_scales 2 --num_groups_per_scale 10 --num_postprocess_cells 3 --num_preprocess_cells 3 --num_cell_per_cond_enc 1 --num_cell_per_cond_dec 1 --num_latent_per_group 20 --num_preprocess_blocks 2 --num_postprocess_blocks 2 --weight_decay_norm 1e-2 --num_channels_enc 16 --num_channels_dec 16 --num_nf 0 --ada_groups --num_process_per_node 1 --use_se --res_dist --fast_adamax
Experiment dir : D:\Project\TAMU\Hierarchical_Variational_Autoencoders\NVAE#start-of-content\CHECKPOINT\/eval-01 starting in debug mode Traceback (most recent call last): File "train.py", line 415, in
init_processes(0, size, main, args)
File "train.py", line 280, in init_processes
dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=size)
File "D:\ProgramData\Anaconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 421, in init_process_group
init_method, rank, world_size, timeout=timeout
File "D:\ProgramData\Anaconda3\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous
raise RuntimeError("No rendezvous handler for {}://".format(result.scheme))
RuntimeError: No rendezvous handler for env://
I am new to PyTorch too, used Tensorflow till now.
Can you please tell me what type of error is this and how can I solve it?