can't run ‘train_vada.py' for best FID on cifar10

wangherr commented 2 years ago

commond:

# cifar10
# - LSGM (best FID):
mpirun --allow-run-as-root -np 2 -npernode 1 bash -c 
    'python train_vada.py --fid_dir $FID_STATS_DIR --data $DATA_DIR/cifar10 --root $CHECKPOINT_DIR \
    --save $EXPR_ID/lsgm2 --vae_checkpoint $EXPR_ID/vae2/checkpoint.pt --train_vae --custom_conv_dae --apply_sqrt2_res \
    --fir --cont_kl_anneal --dae_arch ncsnpp --embedding_scale 1000 --dataset cifar10 --learning_rate_dae 1e-4 \
    --learning_rate_min_dae 1e-4 --epochs 1875 --dropout 0.2 --batch_size 16 --num_channels_dae 512 --num_scales_dae 3 \
    --num_cell_per_scale_dae 8 --sde_type vpsde --beta_start 0.1 --beta_end 20.0 --sigma2_0 0.0 \
    --weight_decay_norm_dae 1e-2 --weight_decay_norm_vae 1e-2 --time_eps 0.01 --train_ode_eps 1e-6 --eval_ode_eps 1e-6 \
    --train_ode_solver_tol 1e-5 --eval_ode_solver_tol 1e-5 --iw_sample_p drop_all_iw --iw_sample_q reweight_p_samples \
    --arch_instance_dae res_ho_attn --num_process_per_node 8 --use_se --node_rank $NODE_RANK --num_proc_node 2 \
    --master_address $IP_ADDR '

wrong:

 File "/**/miniconda3/envs/lsgm/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/**/miniconda3/envs/lsgm/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/**/lsgm/util/utils.py", line 690, in init_processes
    fn(args)
  File "train_vada.py", line 178, in main
    train_obj, global_step = train_vada_joint(train_queue, diffusion_cont, dae, dae_optimizer, vae, vae_optimizer,
  File "/**/lsgm/training_obj_joint.py", line 135, in train_vada_joint
    grad_scalar.scale(p_loss).backward()
  File "/**/miniconda3/envs/lsgm/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/**/miniconda3/envs/lsgm/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [18, 256, 3, 3]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
    buf = self._recv(4)

my env：

python                    3.8.13 
pytorch                   1.8.0           py3.8_cuda11.1_cudnn8.0.5_0
torchvision               0.9.0                py38_cu111

WelkinYang commented 1 year ago

Have you solved this problem yet? I think the reason for this error is that vae's optimizer does a "step" before the SGM loss back propagation, which causes the gradient of SGM loss to be modified. The solution is to detach the gradient of the eps or adjust the position of the "step" of the vae optimizer to after the backpropagation of the SGM loss. I am not sure how the author ran the code, it may be related to the Pytorch version.

wangherr commented 1 year ago

Have you solved this problem yet? I think the reason for this error is that vae's optimizer does a "step" before the SGM loss back propagation, which causes the gradient of SGM loss to be modified. The solution is to detach the gradient of the eps or adjust the position of the "step" of the vae optimizer to after the backpropagation of the SGM loss. I am not sure how the author ran the code, it may be related to the Pytorch version.

I forget how to solve it. Maybe the environment is the key. About diffusion models and latent diffusion models, there are more easier code you could try.

SeunghyunKim1995 commented 8 months ago

I have the same issue. Do you remember how to solve it ?

my env :

Python 3.9.12 torch 1.8.0+cu111 torchvision 0.9.0+cu111

NVlabs / LSGM

can't run ‘train_vada.py' for best FID on cifar10 #8