Cannot train `stylegan3-r` with `512x512`

AmitMY commented 1 year ago

Describe the bug I can train stylegan3-r with 256x256 resolution, or stylegan3-t with 512x512, but not stylegan3-rwith512x512`.

To Reproduce

python train.py --outdir=/scratch/training-runs --cfg=stylegan3-r \
    --data=/scratch/datasets/frames512x512.zip \
    --gpus=4 --batch=16 --gamma=10 --aug=noaug

Logs

Training options:
{
  "G_kwargs": {
    "class_name": "training.networks_stylegan3.Generator",
    "z_dim": 512,
    "w_dim": 512,
    "mapping_kwargs": {
      "num_layers": 2
    },
    "channel_base": 65536,
    "channel_max": 1024,
    "magnitude_ema_beta": 0.9994456359721023,
    "conv_kernel": 1,
    "use_radial_filters": true
  },
  "D_kwargs": {
    "class_name": "training.networks_stylegan2.Discriminator",
    "block_kwargs": {
      "freeze_layers": 0
    },
    "mapping_kwargs": {},
    "epilogue_kwargs": {
      "mbstd_group_size": 4
    },
    "channel_base": 32768,
    "channel_max": 512
  },
  "G_opt_kwargs": {
    "class_name": "torch.optim.Adam",
    "betas": [
      0,
      0.99
    ],
    "eps": 1e-08,
    "lr": 0.0025
  },
  "D_opt_kwargs": {
    "class_name": "torch.optim.Adam",
    "betas": [
      0,
      0.99
    ],
    "eps": 1e-08,
    "lr": 0.002
  },
  "loss_kwargs": {
    "class_name": "training.loss.StyleGAN2Loss",
    "r1_gamma": 10.0,
    "blur_init_sigma": 0,
    "blur_fade_kimg": 100.0
  },
  "data_loader_kwargs": {
    "pin_memory": true,
    "prefetch_factor": 2,
    "num_workers": 3
  },
  "training_set_kwargs": {
    "class_name": "training.dataset.ImageFolderDataset",
    "path": "/scratch/datasets/sign-language-512x512.zip",
    "use_labels": false,
    "max_size": 32704,
    "xflip": false,
    "resolution": 512,
    "random_seed": 0
  },
  "num_gpus": 4,
  "batch_size": 16,
  "batch_gpu": 4,
  "metrics": [
    "fid50k_full"
  ],
  "total_kimg": 25000,
  "kimg_per_tick": 4,
  "image_snapshot_ticks": 50,
  "network_snapshot_ticks": 50,
  "random_seed": 0,
  "ema_kimg": 5.0,
  "resume_pkl": "https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-r-afhqv2-512x512.pkl",
  "ada_kimg": 100,
  "ema_rampup": null,
  "run_dir": "/scratch/training-runs/00029-stylegan3-r-sign-language-512x512-gpus4-batch16-gamma10"
}

Output directory:    /scratch/training-runs/00029-stylegan3-r-sign-language-512x512-gpus4-batch16-gamma10
Number of GPUs:      4
Batch size:          16 images
Training duration:   25000 kimg
Dataset path:        /scratch/datasets/sign-language-512x512.zip
Dataset size:        32704 images
Dataset resolution:  512
Dataset labels:      False
Dataset x-flips:     False

Creating output directory...
Launching processes...
Loading training set...

Num images:  32704
Image shape: [3, 512, 512]
Label shape: [0]

Constructing networks...
Resuming from "https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/stylegan3-r-afhqv2-512x512.pkl"
Setting up PyTorch plugin "bias_act_plugin"... Done.
Setting up PyTorch plugin "filtered_lrelu_plugin"... Done.

Generator                     Parameters  Buffers  Output shape         Datatype
---                           ---         ---      ---                  ---     
mapping.fc0                   262656      -        [4, 512]             float32 
mapping.fc1                   262656      -        [4, 512]             float32 
mapping                       -           512      [4, 16, 512]         float32 
synthesis.input.affine        2052        -        [4, 4]               float32 
synthesis.input               1048576     3081     [4, 1024, 36, 36]    float32 
synthesis.L0_36_1024.affine   525312      -        [4, 1024]            float32 
synthesis.L0_36_1024          1049600     157      [4, 1024, 36, 36]    float32 
synthesis.L1_36_1024.affine   525312      -        [4, 1024]            float32 
synthesis.L1_36_1024          1049600     157      [4, 1024, 36, 36]    float32 
synthesis.L2_52_1024.affine   525312      -        [4, 1024]            float32 
synthesis.L2_52_1024          1049600     169      [4, 1024, 52, 52]    float32 
synthesis.L3_52_1024.affine   525312      -        [4, 1024]            float32 
synthesis.L3_52_1024          1049600     157      [4, 1024, 52, 52]    float32 
synthesis.L4_84_1024.affine   525312      -        [4, 1024]            float32 
synthesis.L4_84_1024          1049600     169      [4, 1024, 84, 84]    float16 
synthesis.L5_84_1024.affine   525312      -        [4, 1024]            float32 
synthesis.L5_84_1024          1049600     157      [4, 1024, 84, 84]    float16 
synthesis.L6_148_1024.affine  525312      -        [4, 1024]            float32 
synthesis.L6_148_1024         1049600     169      [4, 1024, 148, 148]  float16 
synthesis.L7_148_967.affine   525312      -        [4, 1024]            float32 
synthesis.L7_148_967          991175      157      [4, 967, 148, 148]   float16 
synthesis.L8_276_645.affine   496071      -        [4, 967]             float32 
synthesis.L8_276_645          624360      169      [4, 645, 276, 276]   float16 
synthesis.L9_276_431.affine   330885      -        [4, 645]             float32 
synthesis.L9_276_431          278426      157      [4, 431, 276, 276]   float16 
synthesis.L10_532_287.affine  221103      -        [4, 431]             float32 
synthesis.L10_532_287         123984      169      [4, 287, 532, 532]   float16 
synthesis.L11_532_192.affine  147231      -        [4, 287]             float32 
synthesis.L11_532_192         55296       157      [4, 192, 532, 532]   float16 
synthesis.L12_532_128.affine  98496       -        [4, 192]             float32 
synthesis.L12_532_128         24704       25       [4, 128, 532, 532]   float16 
synthesis.L13_512_128.affine  65664       -        [4, 128]             float32 
synthesis.L13_512_128         16512       25       [4, 128, 512, 512]   float16 
synthesis.L14_512_3.affine    65664       -        [4, 128]             float32 
synthesis.L14_512_3           387         1        [4, 3, 512, 512]     float16 
synthesis                     -           -        [4, 3, 512, 512]     float32 
---                           ---         ---      ---                  ---     
Total                         16665594    5588     -                    -       

Setting up PyTorch plugin "upfirdn2d_plugin"... Done.

Discriminator  Parameters  Buffers  Output shape        Datatype
---            ---         ---      ---                 ---     
b512.fromrgb   256         16       [4, 64, 512, 512]   float16 
b512.skip      8192        16       [4, 128, 256, 256]  float16 
b512.conv0     36928       16       [4, 64, 512, 512]   float16 
b512.conv1     73856       16       [4, 128, 256, 256]  float16 
b512           -           16       [4, 128, 256, 256]  float16 
b256.skip      32768       16       [4, 256, 128, 128]  float16 
b256.conv0     147584      16       [4, 128, 256, 256]  float16 
b256.conv1     295168      16       [4, 256, 128, 128]  float16 
b256           -           16       [4, 256, 128, 128]  float16 
b128.skip      131072      16       [4, 512, 64, 64]    float16 
b128.conv0     590080      16       [4, 256, 128, 128]  float16 
b128.conv1     1180160     16       [4, 512, 64, 64]    float16 
b128           -           16       [4, 512, 64, 64]    float16 
b64.skip       262144      16       [4, 512, 32, 32]    float16 
b64.conv0      2359808     16       [4, 512, 64, 64]    float16 
b64.conv1      2359808     16       [4, 512, 32, 32]    float16 
b64            -           16       [4, 512, 32, 32]    float16 
b32.skip       262144      16       [4, 512, 16, 16]    float32 
b32.conv0      2359808     16       [4, 512, 32, 32]    float32 
b32.conv1      2359808     16       [4, 512, 16, 16]    float32 
b32            -           16       [4, 512, 16, 16]    float32 
b16.skip       262144      16       [4, 512, 8, 8]      float32 
b16.conv0      2359808     16       [4, 512, 16, 16]    float32 
b16.conv1      2359808     16       [4, 512, 8, 8]      float32 
b16            -           16       [4, 512, 8, 8]      float32 
b8.skip        262144      16       [4, 512, 4, 4]      float32 
b8.conv0       2359808     16       [4, 512, 8, 8]      float32 
b8.conv1       2359808     16       [4, 512, 4, 4]      float32 
b8             -           16       [4, 512, 4, 4]      float32 
b4.mbstd       -           -        [4, 513, 4, 4]      float32 
b4.conv        2364416     16       [4, 512, 4, 4]      float32 
b4.fc          4194816     -        [4, 512]            float32 
b4.out         513         -        [4, 1]              float32 
---            ---         ---      ---                 ---     
Total          28982849    480      -                   -       

Setting up augmentation...
Distributing across 4 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Training for 25000 kimg...

Traceback (most recent call last):
  File "train.py", line 286, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "train.py", line 281, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "train.py", line 98, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/workspace/stylegan3/train.py", line 47, in subprocess_fn
    training_loop.training_loop(rank=rank, **c)
  File "/workspace/stylegan3/training/training_loop.py", line 278, in training_loop
    loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg)
  File "/workspace/stylegan3/training/loss.py", line 111, in accumulate_gradients
    loss_Dgen.mean().mul(gain).backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 264, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 153, in backward
    Variable._execution_engine.run_backward(
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 87, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore[attr-defined]
  File "/workspace/stylegan3/torch_utils/ops/conv2d_gradfix.py", line 149, in backward
    grad_weight = Conv2dGradWeight.apply(grad_output, input)
  File "/workspace/stylegan3/torch_utils/ops/conv2d_gradfix.py", line 178, in forward
    return torch._C._jit_get_operation(name)(weight_shape, grad_output, input, padding, stride, dilation, groups, *flags)
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = False
data = torch.randn([4, 513, 4, 4], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(513, 512, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams 
    data_type = CUDNN_DATA_FLOAT
    padding = [1, 1, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = false
input: TensorDescriptor 0x7fa4ac0521d0
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 4, 513, 4, 4, 
    strideA = 8208, 16, 4, 1, 
output: TensorDescriptor 0x7fa4ac02b020
    type = CUDNN_DATA_FLOAT
    nbDims = 4
    dimA = 4, 512, 4, 4, 
    strideA = 8192, 16, 4, 1, 
weight: FilterDescriptor 0x7fa4ac0707c0
    type = CUDNN_DATA_FLOAT
    tensor_format = CUDNN_TENSOR_NCHW
    nbDims = 4
    dimA = 512, 513, 3, 3, 
Pointer addresses: 
    input: 0x7fa381da2000
    output: 0x7fa381dc2200
    weight: 0x7fa5c4000000

Additional context Very related unresolved issue https://github.com/NVlabs/stylegan3/issues/78

jasuriy commented 2 months ago

@AmitMY did you fix the issue? were you able to train the model with your own dataset?

AmitMY commented 2 months ago

I only managed to train on 256, and that is what I did eventually. (https://github.com/sign-language-processing/pose-to-video/tree/main/pose_to_video/unconditional/stylegan3#training)

jasuriy commented 2 months ago

hi thank you for your reply.

I wanted to train on 1024x1024 to get higher resolution image outputs. Do you think it is possible? And if possible can you please share your source code ? Highly appreciated. Sincerely

Jasurbek

[image: Mailtrack] https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& Sender notified by Mailtrack https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality11& 05/10/24, 07:14:25 PM

On Fri, May 10, 2024 at 7:09 PM Amit Moryossef @.***> wrote:

I only managed to train on 256, and that is what I did eventually. ( https://github.com/sign-language-processing/pose-to-video/tree/main/pose_to_video/unconditional/stylegan3#training )

— Reply to this email directly, view it on GitHub https://github.com/NVlabs/stylegan3/issues/229#issuecomment-2104334624, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQKLHTUTUXB6NGXYKQGXI43ZBSMFFAVCNFSM6AAAAAAXXR22HGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBUGMZTINRSGQ . You are receiving this because you commented.Message ID: @.***>

NVlabs / stylegan3

Cannot train `stylegan3-r` with `512x512` #229