NVlabs / NVAE

The Official PyTorch Implementation of "NVAE: A Deep Hierarchical Variational Autoencoder" (NeurIPS 2020 spotlight paper)
https://arxiv.org/abs/2007.03898
Other
999 stars 163 forks source link

No rendezvous handler for env:// #12

Open dhruvampanchal opened 3 years ago

dhruvampanchal commented 3 years ago

As per the readme file, I have changed some values in the arguments to make the model easier to train. Since, I have only one gpu, I have change num_process_per_node to 1.

But I am getting this error.

python train.py --data D:\Project\TAMU\Hierarchical_Variational_Autoencoders\NVAE#start-of-content\mnist --root D:\Project\TAMU\Hierarchical_Variational_Autoencoders\NVAE#start-of-content\CHECKPOINT\ --save 01 --dataset mnist --batch_size 200 --epochs 100 --num_latent_scales 2 --num_groups_per_scale 10 --num_postprocess_cells 3 --num_preprocess_cells 3 --num_cell_per_cond_enc 1 --num_cell_per_cond_dec 1 --num_latent_per_group 20 --num_preprocess_blocks 2 --num_postprocess_blocks 2 --weight_decay_norm 1e-2 --num_channels_enc 16 --num_channels_dec 16 --num_nf 0 --ada_groups --num_process_per_node 1 --use_se --res_dist --fast_adamax

Experiment dir : D:\Project\TAMU\Hierarchical_Variational_Autoencoders\NVAE#start-of-content\CHECKPOINT\/eval-01 starting in debug mode Traceback (most recent call last): File "train.py", line 415, in init_processes(0, size, main, args) File "train.py", line 280, in init_processes dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=size) File "D:\ProgramData\Anaconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 421, in init_process_group init_method, rank, world_size, timeout=timeout File "D:\ProgramData\Anaconda3\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous raise RuntimeError("No rendezvous handler for {}://".format(result.scheme)) RuntimeError: No rendezvous handler for env://

I am new to PyTorch too, used Tensorflow till now.

Can you please tell me what type of error is this and how can I solve it?

arash-vahdat commented 3 years ago

Are you planning on training the model only on 1 GPU? The error you are seeing is raised by one of the PyTorch functions used for the multi-GPU setup. In our setup, we could run with --num_process_per_node 1 without any issue probably because our system supported init_method="env://".

dhruvampanchal commented 3 years ago

I have only 1 RTX 2080 Super MaxQ GPU right now. And I think it does not support init_method="env://".

whuLames commented 1 year ago

I have a problem that How can I check whether my gpu supporting init_method="env://"