NVlabs / NVAE

The Official PyTorch Implementation of "NVAE: A Deep Hierarchical Variational Autoencoder" (NeurIPS 2020 spotlight paper)
https://arxiv.org/abs/2007.03898
Other
999 stars 163 forks source link

RuntimeError: NCCL error #9

Open eyalbetzalel opened 3 years ago

eyalbetzalel commented 3 years ago

Hi,

I am trying to run NVAE on my machine with your command line for CIFAR10 (updating only the .. from 8 to 4 cause I own 4 GPUs):

export EXPR_ID=/home/dsi/eyalbetzalel/NVAE/logs  
export DATA_DIR=/home/dsi/eyalbetzalel/NVAE/data 
export CHECKPOINT_DIR=/home/dsi/eyalbetzalel/NVAE/cpt  
export CODE_DIR=/home/dsi/eyalbetzalel/NVAE  
cd $CODE_DIR

nohup python train.py --data $DATA_DIR/cifar10 --root $CHECKPOINT_DIR --save $EXPR_ID --dataset cifar10 \
        --num_channels_enc 128 --num_channels_dec 128 --epochs 400 --num_postprocess_cells 2 --num_preprocess_cells 2 \
        --num_latent_scales 1 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 \
        --num_preprocess_blocks 1 --num_postprocess_blocks 1 --num_groups_per_scale 30 --batch_size 32 \
        --weight_decay_norm 1e-2 --num_nf 1 --num_process_per_node 4 --use_se --res_dist --fast_adamax &> NVAE_DSIGPU13_test_2_22102020.out &

and get this error:

File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "train.py", line 281, in init_processes fn(args) File "train.py", line 92, in main train_nelbo, global_step = train(train_queue, model, cnn_optimizer, grad_scalar, global_step, warmup_iters, writer, logging) File "train.py", line 160, in train utils.average_params(model.parameters(), args.distributed) File "/home/dsi/eyalbetzalel/NVAE/utils.py", line 274, in average_params dist.all_reduce(param.data, op=dist.ReduceOp.SUM) File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 936, in all_reduce work = _default_pg.allreduce([tensor], opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled system error, NCCL version 2.4.8**

am I doing something wrong?

Thanks, Eyal

@

kaushik333 commented 3 years ago

Perhaps a version mismatch between pytorch, cuda and nccl version? What versions are you using ?

ImanHosseini commented 3 years ago

Are you running with WSL? WSL does not yet support NCCL: https://github.com/NVIDIA/nccl/issues/442 If you are on WSL, then you can try changing backend in train.py:280 "dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=size)" from "nccl" to "gloo".