autonomousvision / stylegan-xl

[SIGGRAPH'22] StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets
MIT License
961 stars 113 forks source link

RuntimeError: output is too large #83

Open BenjiKCF opened 2 years ago

BenjiKCF commented 2 years ago

I have pretrained the model from the 64x64 images. And now I am on the superresolution stage and i wanna get a 256x256 images.

python train.py --outdir=./training-runs/styleganxl_training_reduced_256 --cfg=stylegan3-t --data=./data/styleganxl_training_reduced256.zip \
  --gpus=2 --batch=24 --mirror=1 --snap 10 --batch-gpu 12 --kimg 10000 --syn_layers 10 --cond True --mirror True --cbase 16384 --cmax 256 --syn_layers 7 \
  --superres --up_factor 4 --head_layers 4 \
  --path_stem training-runs/styleganxl_training_reduced_64/00000-stylegan3-t-styleganxl_training_reduced64-gpus2-batch176/best_model.pkl

The following error shows a RuntimeError: output is too large.


Setting up augmentation...
Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...
/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/filtered_lrelu.py:225: RuntimeWarning: filtered_lrelu called with parameters that have no optimized CUDA kernel, using generic fallback
  warnings.warn("filtered_lrelu called with parameters that have no optimized CUDA kernel, using generic fallback", RuntimeWarning)
Initializing logs...
Training for 10000 kimg...

/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/filtered_lrelu.py:225: RuntimeWarning: filtered_lrelu called with parameters that have no optimized CUDA kernel, using generic fallback
  warnings.warn("filtered_lrelu called with parameters that have no optimized CUDA kernel, using generic fallback", RuntimeWarning)
Traceback (most recent call last):
  File "/RP1/mydocker/Ben/mtr/stylegan_xl/train.py", line 336, in <module>
    main()  # pylint: disable=no-value-for-parameter
  File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/RP1/mydocker/Ben/mtr/stylegan_xl/train.py", line 321, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "/RP1/mydocker/Ben/mtr/stylegan_xl/train.py", line 106, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/RP1/mydocker/Ben/mtr/stylegan_xl/train.py", line 49, in subprocess_fn
    training_loop.training_loop(rank=rank, **c)
  File "/RP1/mydocker/Ben/mtr/stylegan_xl/training/training_loop.py", line 339, in training_loop
    loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg)
  File "/RP1/mydocker/Ben/mtr/stylegan_xl/training/loss.py", line 121, in accumulate_gradients
    loss_Gmain.backward()
  File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
  File "/opt/conda/envs/sgxl/lib/python3.9/site-packages/torch/autograd/function.py", line 87, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore[attr-defined]
  File "/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/filtered_lrelu.py", line 264, in backward
    dx = _filtered_lrelu_cuda(up=down, down=up, padding=pp, gain=gg, slope=slope, clamp=None, flip_filter=ff).apply(dy, fd, fu, None, si, sx, sy)
  File "/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/filtered_lrelu.py", line 228, in forward
    y = upfirdn2d.upfirdn2d(x=y, f=fu, up=up, padding=[px0, px1, py0, py1], gain=up**2, flip_filter=flip_filter) # Upsample.
  File "/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/upfirdn2d.py", line 161, in upfirdn2d
    return _upfirdn2d_cuda(up=up, down=down, padding=padding, flip_filter=flip_filter, gain=gain).apply(x, f)
  File "/RP1/mydocker/Ben/mtr/stylegan_xl/torch_utils/ops/upfirdn2d.py", line 245, in forward
    y = _plugin.upfirdn2d(y, f.unsqueeze(1), 1, upy, 1, downy, 0, 0, pady0, pady1, flip_filter, gain)
RuntimeError: output is too large

However, I can run this

python train.py --outdir=./training-runs/styleganxl_training_reduced_128 --cfg=stylegan3-t --data=./data/styleganxl_training_reduced128.zip \
  --gpus=2 --batch=32 --mirror=1 --snap 10 --batch-gpu 16 --kimg 10000 --syn_layers 10 --cond True --mirror True --cbase 16384 --cmax 256 --syn_layers 7 \
  --superres --up_factor 2 --head_layers 4 \
  --path_stem training-runs/styleganxl_training_reduced_64/00000-stylegan3-t-styleganxl_training_reduced64-gpus2-batch176/best_model.pkl
nkumarrai commented 1 year ago

Hi, I am also hitting this same issue. Please share any clues if you've. Thanks.

nkumarrai commented 1 year ago

Hi @BenjiKCF , I was running the training on V100 32 GB node. I decreased the batch size to 1 and it didn't throw the error. I'd suggest to try the same.