autonomousvision / stylegan-xl

[SIGGRAPH'22] StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets
MIT License
961 stars 113 forks source link

RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior. #96

Closed DQSSSSS closed 1 year ago

DQSSSSS commented 1 year ago

When I train this model until 1000k images, I received this error message

tick 233   kimg 999.7    time 13m 53s      sec/tick 15.3    sec/kimg 3.74    maintenance 0.3    cpumem 5.54   gpumem 9.91   reserved 36.77  augment 0.000
Traceback (most recent call last):
  File "train.py", line 343, in <module>
    main()  # pylint: disable=no-value-for-parameter
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "train.py", line 328, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "train.py", line 113, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/tmp/code/train.py", line 49, in subprocess_fn
    training_loop.training_loop(rank=rank, **c)
  File "/tmp/code/training/training_loop.py", line 339, in training_loop
    loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg)
  File "/tmp/code/training/loss.py", line 131, in accumulate_gradients
    pl_grads = torch.autograd.grad(outputs=[(gen_img * pl_noise).sum()], inputs=[gen_ws], create_graph=True, only_inputs=True)[0]
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 234, in grad
    return Variable._execution_engine.run_backward(
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

Here are my running command:

python train.py --outdir=/XXX/training-runs/results --cfg=stylegan3-t --data=/XXX/dataset_label_stylegan3.zip                 --gpus=8 --batch=256 --batch-gpu=16 --snap=50 --kimg=1500 --syn_layers=4                 --cbase=32768 --cmax=512                 --metrics=fid50k_full                 --cond=False --resume=/XXX/training-runs/results/00000-stylegan3-t-dataset_label_stylegan3-gpus8-batch256/network-snapshot.pkl

My environment:

OS: Linux, docker image
Python: 3.8
Pytorch: 1.10
cuda: 11.4
GPU: NVidia A100
DQSSSSS commented 1 year ago

Solve it, I deleted the code of the Fourier translation, so I should set allow_unused=True... Sorry...

DQSSSSS commented 1 year ago

Set allow_unused=True doesn't solve it, I redesign the input layer code: change self.affine to weight and bias buffer object, it will keep the graph and doesn't contribute to the gradient.