Stuck at "Training for 1000 kimg..."

davebobobo commented 2 years ago

Hey guys,

complete beginner here. Thanks to some tutorials I made it this far: I created a dataset (200 pictures) and try to train a new network. I am then stuck without ever getting to Tick 0. fakes_init.png is created properly. I am running on 2x 3080 10 GB (which won't get any load)

(environment) PS C:\Users\david\stylegan3> python train.py --outdir E:\LL\Training --data E:\LL\Dest\Dest.zip --cfg=stylegan3-t --gpus=2 --batch=32 --gamma=8 --batch-gpu=8 --snap=20 --kimg=1000

Training options:
{
  "G_kwargs": {
    "class_name": "training.networks_stylegan3.Generator",
    "z_dim": 512,
    "w_dim": 512,
    "mapping_kwargs": {
      "num_layers": 2
    },
    "channel_base": 32768,
    "channel_max": 512,
    "magnitude_ema_beta": 0.9988915792636801
  },
  "D_kwargs": {
    "class_name": "training.networks_stylegan2.Discriminator",
    "block_kwargs": {
      "freeze_layers": 0
    },
    "mapping_kwargs": {},
    "epilogue_kwargs": {
      "mbstd_group_size": 4
    },
    "channel_base": 32768,
    "channel_max": 512
  },
  "G_opt_kwargs": {
    "class_name": "torch.optim.Adam",
    "betas": [
      0,
      0.99
    ],
    "eps": 1e-08,
    "lr": 0.0025
  },
  "D_opt_kwargs": {
    "class_name": "torch.optim.Adam",
    "betas": [
      0,
      0.99
    ],
    "eps": 1e-08,
    "lr": 0.002
  },
  "loss_kwargs": {
    "class_name": "training.loss.StyleGAN2Loss",
    "r1_gamma": 8.0
  },
  "data_loader_kwargs": {
    "pin_memory": true,
    "prefetch_factor": 2,
    "num_workers": 3
  },
  "training_set_kwargs": {
    "class_name": "training.dataset.ImageFolderDataset",
    "path": "E:\\LL\\Dest\\Dest.zip",
    "use_labels": false,
    "max_size": 216,
    "xflip": false,
    "resolution": 512,
    "random_seed": 0
  },
  "num_gpus": 2,
  "batch_size": 32,
  "batch_gpu": 8,
  "metrics": [
    "fid50k_full"
  ],
  "total_kimg": 1000,
  "kimg_per_tick": 4,
  "image_snapshot_ticks": 20,
  "network_snapshot_ticks": 20,
  "random_seed": 0,
  "ema_kimg": 10.0,
  "augment_kwargs": {
    "class_name": "training.augment.AugmentPipe",
    "xflip": 1,
    "rotate90": 1,
    "xint": 1,
    "scale": 1,
    "rotate": 1,
    "aniso": 1,
    "xfrac": 1,
    "brightness": 1,
    "contrast": 1,
    "lumaflip": 1,
    "hue": 1,
    "saturation": 1
  },
  "ada_target": 0.6,
  "run_dir": "E:\\LL\\Training\\00011-stylegan3-t-Dest-gpus2-batch32-gamma8"
}

Output directory:    E:\LL\Training\00011-stylegan3-t-Dest-gpus2-batch32-gamma8
Number of GPUs:      2
Batch size:          32 images
Training duration:   1000 kimg
Dataset path:        E:\LL\Dest\Dest.zip
Dataset size:        216 images
Dataset resolution:  512
Dataset labels:      False
Dataset x-flips:     False

Creating output directory...
Launching processes...
Loading training set...

Num images:  216
Image shape: [3, 512, 512]
Label shape: [0]

Constructing networks...
Setting up PyTorch plugin "bias_act_plugin"... Done.
Setting up PyTorch plugin "filtered_lrelu_plugin"... Done.

Generator                     Parameters  Buffers  Output shape        Datatype
---                           ---         ---      ---                 ---
mapping.fc0                   262656      -        [8, 512]            float32
mapping.fc1                   262656      -        [8, 512]            float32
mapping                       -           512      [8, 16, 512]        float32
synthesis.input.affine        2052        -        [8, 4]              float32
synthesis.input               262144      1545     [8, 512, 36, 36]    float32
synthesis.L0_36_512.affine    262656      -        [8, 512]            float32
synthesis.L0_36_512           2359808     25       [8, 512, 36, 36]    float32
synthesis.L1_36_512.affine    262656      -        [8, 512]            float32
synthesis.L1_36_512           2359808     25       [8, 512, 36, 36]    float32
synthesis.L2_52_512.affine    262656      -        [8, 512]            float32
synthesis.L2_52_512           2359808     37       [8, 512, 52, 52]    float32
synthesis.L3_52_512.affine    262656      -        [8, 512]            float32
synthesis.L3_52_512           2359808     25       [8, 512, 52, 52]    float32
synthesis.L4_84_512.affine    262656      -        [8, 512]            float32
synthesis.L4_84_512           2359808     37       [8, 512, 84, 84]    float16
synthesis.L5_84_512.affine    262656      -        [8, 512]            float32
synthesis.L5_84_512           2359808     25       [8, 512, 84, 84]    float16
synthesis.L6_148_512.affine   262656      -        [8, 512]            float32
synthesis.L6_148_512          2359808     37       [8, 512, 148, 148]  float16
synthesis.L7_148_483.affine   262656      -        [8, 512]            float32
synthesis.L7_148_483          2226147     25       [8, 483, 148, 148]  float16
synthesis.L8_276_323.affine   247779      -        [8, 483]            float32
synthesis.L8_276_323          1404404     37       [8, 323, 276, 276]  float16
synthesis.L9_276_215.affine   165699      -        [8, 323]            float32
synthesis.L9_276_215          625220      25       [8, 215, 276, 276]  float16
synthesis.L10_532_144.affine  110295      -        [8, 215]            float32
synthesis.L10_532_144         278784      37       [8, 144, 532, 532]  float16
synthesis.L11_532_96.affine   73872       -        [8, 144]            float32
synthesis.L11_532_96          124512      25       [8, 96, 532, 532]   float16
synthesis.L12_532_64.affine   49248       -        [8, 96]             float32
synthesis.L12_532_64          55360       25       [8, 64, 532, 532]   float16
synthesis.L13_512_64.affine   32832       -        [8, 64]             float32
synthesis.L13_512_64          36928       25       [8, 64, 512, 512]   float16
synthesis.L14_512_3.affine    32832       -        [8, 64]             float32
synthesis.L14_512_3           195         1        [8, 3, 512, 512]    float16
synthesis                     -           -        [8, 3, 512, 512]    float32
---                           ---         ---      ---                 ---
Total                         24873519    2468     -                   -

Setting up PyTorch plugin "upfirdn2d_plugin"... Done.

Discriminator  Parameters  Buffers  Output shape        Datatype
---            ---         ---      ---                 ---
b512.fromrgb   256         16       [8, 64, 512, 512]   float16
b512.skip      8192        16       [8, 128, 256, 256]  float16
b512.conv0     36928       16       [8, 64, 512, 512]   float16
b512.conv1     73856       16       [8, 128, 256, 256]  float16
b512           -           16       [8, 128, 256, 256]  float16
b256.skip      32768       16       [8, 256, 128, 128]  float16
b256.conv0     147584      16       [8, 128, 256, 256]  float16
b256.conv1     295168      16       [8, 256, 128, 128]  float16
b256           -           16       [8, 256, 128, 128]  float16
b128.skip      131072      16       [8, 512, 64, 64]    float16
b128.conv0     590080      16       [8, 256, 128, 128]  float16
b128.conv1     1180160     16       [8, 512, 64, 64]    float16
b128           -           16       [8, 512, 64, 64]    float16
b64.skip       262144      16       [8, 512, 32, 32]    float16
b64.conv0      2359808     16       [8, 512, 64, 64]    float16
b64.conv1      2359808     16       [8, 512, 32, 32]    float16
b64            -           16       [8, 512, 32, 32]    float16
b32.skip       262144      16       [8, 512, 16, 16]    float32
b32.conv0      2359808     16       [8, 512, 32, 32]    float32
b32.conv1      2359808     16       [8, 512, 16, 16]    float32
b32            -           16       [8, 512, 16, 16]    float32
b16.skip       262144      16       [8, 512, 8, 8]      float32
b16.conv0      2359808     16       [8, 512, 16, 16]    float32
b16.conv1      2359808     16       [8, 512, 8, 8]      float32
b16            -           16       [8, 512, 8, 8]      float32
b8.skip        262144      16       [8, 512, 4, 4]      float32
b8.conv0       2359808     16       [8, 512, 8, 8]      float32
b8.conv1       2359808     16       [8, 512, 4, 4]      float32
b8             -           16       [8, 512, 4, 4]      float32
b4.mbstd       -           -        [8, 513, 4, 4]      float32
b4.conv        2364416     16       [8, 512, 4, 4]      float32
b4.fc          4194816     -        [8, 512]            float32
b4.out         513         -        [8, 1]              float32
---            ---         ---      ---                 ---
Total          28982849    480      -                   -

Setting up augmentation...
Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Training for 1000 kimg...

What I tried Changing the batch size. Didn't help.

Desktop (please complete the following information): OS:Windows 10 PyTorch pytorch: 1.9.1+cu111 CUDA toolkit version CUDA 11.1 NVIDIA driver version 472.47 GPU 2x 3080 RTX (10 GB) Docker: did you use Docker? No

Any hints on what I could try? Thanks

EDIT After 30 minutes it gave me this error message

Traceback (most recent call last):
  File "C:\Users\david\stylegan3\train.py", line 286, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "C:\Users\david\anaconda3\envs\environment\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\david\anaconda3\envs\environment\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\Users\david\anaconda3\envs\environment\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\david\anaconda3\envs\environment\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\Users\david\stylegan3\train.py", line 281, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "C:\Users\david\stylegan3\train.py", line 98, in launch_training
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus)
  File "C:\Users\david\anaconda3\envs\environment\lib\site-packages\torch\multiprocessing\spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "C:\Users\david\anaconda3\envs\environment\lib\site-packages\torch\multiprocessing\spawn.py", line 188, in start_processes
    while not context.join():
  File "C:\Users\david\anaconda3\envs\environment\lib\site-packages\torch\multiprocessing\spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "C:\Users\david\anaconda3\envs\environment\lib\site-packages\torch\multiprocessing\spawn.py", line 59, in _wrap
    fn(i, *args)
  File "C:\Users\david\stylegan3\train.py", line 47, in subprocess_fn
    training_loop.training_loop(rank=rank, **c)
  File "C:\Users\david\stylegan3\training\training_loop.py", line 287, in training_loop
    torch.distributed.all_reduce(flat)
  File "C:\Users\david\anaconda3\envs\environment\lib\site-packages\torch\distributed\distributed_c10d.py", line 1176, in all_reduce
    work.wait()
RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete

LuoXubo commented 2 years ago

Same problem. Have you solved it?

Flofian commented 2 years ago

This definitley is not a "real" solution, but i had the same thing with stylegan3-t or stylegan3-r as config, so i just went back to the stylegan2 config

felkoh commented 2 years ago

I'm also having the same issue when setting gpus=2. When set to 1 GPU I can then produce my first fake image from tick 0, however, it then gets stuck on calculating metrics. Setting metrics=none doesn't fix this, just stops at fake00000. Were you able to get yours working yet?

davebobobo commented 2 years ago

I only got it working by installing Ubuntu.

felkoh commented 2 years ago

Just noticed it isn't stuck on metrics just takes a very very long time with a single GPU -.-

domef commented 2 years ago

I'm training the model in colab and I'm having problems too. The training gets stucked after tick 0.

xiaomao19970819 commented 2 years ago

@domef Hey, did you figure out where the card was? It seems that this model training requires a lot of resources

domef commented 2 years ago

@xiaomao19970819 I didn't train anymore but probably it was just very slow (I was training on colab).

JuliusJacobsohn commented 2 years ago

I'm having the same issue with a single 3070 gpu (8gb). The interesting part is, it did work for 120k iterations, but I had to shut down my pc and now I'm getting this error while trying to resume the latest snapshot. I did change the snap from 5 to 20 (because the metrics report takes like 30 minutes and it's happening way too often) and I also changed the number of workers from the default value (I think 2?) to 8, since I have a 12 core processor.

At the point where it says "Training for 25000 kimg...", my ram fills up to the max (64gb) and my pc becomes very unresponsive. I've let this run for over 30 minutes without anything happening. Also tried different batch sizes, that didn't change anything.

JuliusJacobsohn commented 2 years ago

Update: Resuming with my original settings (batch=4 and workers=default) resulted in no issues. My ram is almost full though at~60/64gb.

pinnintipraneethkumar commented 4 months ago

Hii @felkoh can you say how much time it got struck? or if any alternative solution.

Thank you

NVlabs / stylegan3

Stuck at "Training for 1000 kimg..." #70