Problems occurred in training CelebA64 data

Update: --num_process_per_node 8 denotes the gpu number !!! So i need to change it to 1.

The first silly question asked under a great project, just laughed, I hope everyone succeeds

Hi, dear Arash Vahdat ,

NVAE is a great job! We are excited to meet this official implement.

After hesitating for two days, I still can’t help but ask, any friend else has successfully reproduce this implement in private machine?

During my running process, there still some errors, and I'm not sure whether this is purely resulted from my gpu.

Here is the traceback message:

(hsj-torch-gpu16) hsj@hsj:/data/hsj/NVAE$ python train.py --data ./scripts/data1/datasets/celeba_org/celeba64_lmdb --root ./CHECKPOINT_DIR --save ./EXPR_ID --dataset celeba_64 --num_channels_enc 32 --num_channels_dec 32 --epochs 90 --num_postprocess_cells 2 --num_preprocess_cells 2 --num_latent_scales 3 --num_latent_per_group 20 --num_cell_per_cond_enc 1 --num_cell_per_cond_dec 1 --num_preprocess_blocks 1 --num_postprocess_blocks 1 --weight_decay_norm 1e-1 --num_groups_per_scale 5 --batch_size 1 --num_nf 1 --ada_groups --num_process_per_node 8 --use_se --res_dist --fast_adamax Experiment dir : ./CHECKPOINT_DIR/eval-./EXPR_ID Node rank 0, local proc 0, global proc 0 Node rank 0, local proc 1, global proc 1 Node rank 0, local proc 2, global proc 2 Node rank 0, local proc 3, global proc 3 Node rank 0, local proc 4, global proc 4 Node rank 0, local proc 5, global proc 5 Node rank 0, local proc 6, global proc 6 Node rank 0, local proc 7, global proc 7 Process Process-1: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 280, in init_processes dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=size) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout) RuntimeError: Address already in use THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal Process Process-3: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal Process Process-4: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal Process Process-5: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal Process Process-6: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal Process Process-8: Process Process-7: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 Process Process-2: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 281, in init_processes fn(args) File "train.py", line 42, in main model = AutoEncoder(args, writer, arch_instance) File "/data/hsj/NVAE/model.py", line 163, in __init__ self.init_normal_sampler(mult) File "/data/hsj/NVAE/model.py", line 270, in init_normal_sampler nf_cells.append(PairedCellAR(self.num_latent_per_group, num_c1, num_c2, arch)) File "/data/hsj/NVAE/model.py", line 93, in __init__ self.cell1 = CellAR(num_z, num_ftr, num_c, arch, mirror=False) File "/data/hsj/NVAE/model.py", line 66, in __init__ self.conv = ARInvertedResidual(num_z, num_ftr, ex=ex, mirror=mirror) File "/data/hsj/NVAE/neural_ar_operations.py", line 147, in __init__ layers.extend([ARConv2d(inz, hidden_dim, kernel_size=3, padding=1, masked=True, mirror=mirror, zero_diag=True), File "/data/hsj/NVAE/neural_ar_operations.py", line 87, in __init__ self.mask = torch.from_numpy(create_conv_mask(kernel_size, C_in, groups, C_out, zero_diag, mirror)).cuda() RuntimeError: CUDA error: out of memory (hsj-torch-gpu16) bjfu@bjfu-15043:/data/hsj/NVAE$ lspci -vnn | grep -A6 "VGA" File "/data/hsj/NVAE/neural_ar_operations.py", line 147, in __init__ layers.extend([ARConv2d(inz, hidden_dim, kernel_size=3, padding=1, masked=True, mirror=mirror, zero_diag=True), File "/data/hsj/NVAE/neural_ar_operations.py", line 87, in __init__ self.mask = torch.from_numpy(create_conv_mask(kernel_size, C_in, groups, C_out, zero_diag, mirror)).cuda() RuntimeError: CUDA error: out of memory

Ps. Due to the GoogleDrive problem, I download the data specifically, and added into /data1 file ,then convert them into lmdb type data.

My first instinct is that the gpu memory is not enough, so I have reduced the batch parameters and model parameters as much as possible, as you can see in the first command line, however, it still don't work.

Ps. Devices Information: ` NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2

2 * 1080ti ,while second one is running other job, only device 0 available

torch==1.6.0 torchvision==0.7.0 `

This NVAE is a great breakthrough job, expecting that we all can reproduce this job and get more inspirations from this.

Looking forward to get any useful suggestion,

Sincerely

Luke Huang

NVlabs / NVAE

Problems occurred in training CelebA64 data #1