Hanging "Export Sample Images" step

dokluch commented 2 years ago

Describe the bug Nothing happens after Export Sample Images No error message, just nothing for an hour+

To Reproduce Steps to reproduce the behavior:

Training with --gpus 8

Please copy&paste text instead of screenshots for better searchability.

Training options:
    "prefetch_factor": 2,
    "num_workers": 3
  },
  "training_set_kwargs": {
    "class_name": "training.dataset.ImageFolderDataset",
    "path": "../datasets/stylegan3-1024.zip",
    "use_labels": false,
    "max_size": 2537,
    "xflip": true,
    "resolution": 1024,
    "random_seed": 0
  },
  "num_gpus": 8,
  "batch_size": 32,
  "batch_gpu": 4,
  "metrics": [
    "fid50k_full"
  ],
  "total_kimg": 25000,
  "kimg_per_tick": 4,
  "image_snapshot_ticks": 50,
  "network_snapshot_ticks": 50,
  "random_seed": 0,
  "ema_kimg": 10.0,
  "augment_kwargs": {
    "class_name": "training.augment.AugmentPipe",
    "xflip": 1,
    "rotate90": 1,
    "xint": 1,
    "scale": 1,
    "rotate": 1,
    "aniso": 1,
    "xfrac": 1,
    "brightness": 1,
    "contrast": 1,
    "lumaflip": 1,
    "hue": 1,
    "saturation": 1
  },
  "ada_target": 0.6,
  "run_dir": "../results/00006-stylegan3-t-stylegan3-1024-gpus8-batch32-gamma1"
}

Output directory:    ../results/00006-stylegan3-t-stylegan3-1024-gpus8-batch32-gamma1
Number of GPUs:      8
Batch size:          32 images
Training duration:   25000 kimg
Dataset path:        ../datasets/stylegan3-1024.zip
Dataset size:        2537 images
Dataset resolution:  1024
Dataset labels:      False
Dataset x-flips:     True

Creating output directory...
Launching processes...
Loading training set...

Num images:  5074
Image shape: [3, 1024, 1024]
Label shape: [0]

Constructing networks...
Setting up PyTorch plugin "bias_act_plugin"... Done.
Setting up PyTorch plugin "filtered_lrelu_plugin"... Done.

Generator                     Parameters  Buffers  Output shape         Datatype
---                           ---         ---      ---                  ---
mapping.fc0                   262656      -        [4, 512]             float32
mapping.fc1                   262656      -        [4, 512]             float32
mapping                       -           512      [4, 16, 512]         float32
synthesis.input.affine        2052        -        [4, 4]               float32
synthesis.input               262144      1545     [4, 512, 36, 36]     float32
synthesis.L0_36_512.affine    262656      -        [4, 512]             float32
synthesis.L0_36_512           2359808     25       [4, 512, 36, 36]     float32
synthesis.L1_36_512.affine    262656      -        [4, 512]             float32
synthesis.L1_36_512           2359808     25       [4, 512, 36, 36]     float32
synthesis.L2_52_512.affine    262656      -        [4, 512]             float32
synthesis.L2_52_512           2359808     37       [4, 512, 52, 52]     float32
synthesis.L3_52_512.affine    262656      -        [4, 512]             float32
synthesis.L3_52_512           2359808     25       [4, 512, 52, 52]     float32
synthesis.L4_84_512.affine    262656      -        [4, 512]             float32
synthesis.L4_84_512           2359808     37       [4, 512, 84, 84]     float32
synthesis.L5_148_512.affine   262656      -        [4, 512]             float32
synthesis.L5_148_512          2359808     37       [4, 512, 148, 148]   float16
synthesis.L6_148_512.affine   262656      -        [4, 512]             float32
synthesis.L6_148_512          2359808     25       [4, 512, 148, 148]   float16
synthesis.L7_276_323.affine   262656      -        [4, 512]             float32
synthesis.L7_276_323          1488707     37       [4, 323, 276, 276]   float16
synthesis.L8_276_203.affine   165699      -        [4, 323]             float32
synthesis.L8_276_203          590324      25       [4, 203, 276, 276]   float16
synthesis.L9_532_128.affine   104139      -        [4, 203]             float32
synthesis.L9_532_128          233984      37       [4, 128, 532, 532]   float16
synthesis.L10_1044_81.affine  65664       -        [4, 128]             float32
synthesis.L10_1044_81         93393       37       [4, 81, 1044, 1044]  float16
synthesis.L11_1044_51.affine  41553       -        [4, 81]              float32
synthesis.L11_1044_51         37230       25       [4, 51, 1044, 1044]  float16
synthesis.L12_1044_32.affine  26163       -        [4, 51]              float32
synthesis.L12_1044_32         14720       25       [4, 32, 1044, 1044]  float16
synthesis.L13_1024_32.affine  16416       -        [4, 32]              float32
synthesis.L13_1024_32         9248        25       [4, 32, 1024, 1024]  float16
synthesis.L14_1024_3.affine   16416       -        [4, 32]              float32
synthesis.L14_1024_3          99          1        [4, 3, 1024, 1024]   float16
synthesis                     -           -        [4, 3, 1024, 1024]   float32
---                           ---         ---      ---                  ---
Total                         22313167    2480     -                    -

Setting up PyTorch plugin "upfirdn2d_plugin"... Done.

Discriminator  Parameters  Buffers  Output shape         Datatype
---            ---         ---      ---                  ---
b1024.fromrgb  128         16       [4, 32, 1024, 1024]  float16
b1024.skip     2048        16       [4, 64, 512, 512]    float16
b1024.conv0    9248        16       [4, 32, 1024, 1024]  float16
b1024.conv1    18496       16       [4, 64, 512, 512]    float16
b1024          -           16       [4, 64, 512, 512]    float16
b512.skip      8192        16       [4, 128, 256, 256]   float16
b512.conv0     36928       16       [4, 64, 512, 512]    float16
b512.conv1     73856       16       [4, 128, 256, 256]   float16
b512           -           16       [4, 128, 256, 256]   float16
b256.skip      32768       16       [4, 256, 128, 128]   float16
b256.conv0     147584      16       [4, 128, 256, 256]   float16
b256.conv1     295168      16       [4, 256, 128, 128]   float16
b256           -           16       [4, 256, 128, 128]   float16
b128.skip      131072      16       [4, 512, 64, 64]     float16
b128.conv0     590080      16       [4, 256, 128, 128]   float16
b128.conv1     1180160     16       [4, 512, 64, 64]     float16
b128           -           16       [4, 512, 64, 64]     float16
b64.skip       262144      16       [4, 512, 32, 32]     float32
b64.conv0      2359808     16       [4, 512, 64, 64]     float32
b64.conv1      2359808     16       [4, 512, 32, 32]     float32
b64            -           16       [4, 512, 32, 32]     float32
b32.skip       262144      16       [4, 512, 16, 16]     float32
b32.conv0      2359808     16       [4, 512, 32, 32]     float32
b32.conv1      2359808     16       [4, 512, 16, 16]     float32
b32            -           16       [4, 512, 16, 16]     float32
b16.skip       262144      16       [4, 512, 8, 8]       float32
b16.conv0      2359808     16       [4, 512, 16, 16]     float32
b16.conv1      2359808     16       [4, 512, 8, 8]       float32
b16            -           16       [4, 512, 8, 8]       float32
b8.skip        262144      16       [4, 512, 4, 4]       float32
b8.conv0       2359808     16       [4, 512, 8, 8]       float32
b8.conv1       2359808     16       [4, 512, 4, 4]       float32
b8             -           16       [4, 512, 4, 4]       float32
b4.mbstd       -           -        [4, 513, 4, 4]       float32
b4.conv        2364416     16       [4, 512, 4, 4]       float32
b4.fc          4194816     -        [4, 512]             float32
b4.out         513         -        [4, 1]               float32
---            ---         ---      ---                  ---
Total          29012513    544      -                    -

Setting up augmentation...
Distributing across 8 GPUs...
Setting up training phases...
Exporting sample images...

Expected behavior Training with 1 GPU works

Desktop (please complete the following information):

OS: Ubuntu 20.04
PyTorch version 1.9.0
CUDA toolkit version 11.5
NVIDIA driver version 495.29.05
GPU 8x A5000
Docker: No

IvanGarcia7 commented 2 years ago

Any solution to fix that? I have the same problem trying to train Stylegan3 with --gpus 2.

kobeshegu commented 1 year ago

Have you fixed this problem? I have the same problem...

fonzen1 commented 1 year ago

I have the same issue.

wenyi-li commented 5 months ago

I have the same issue. When I use ONLY 1 gpu, everything runs well

NVlabs / stylegan3

Hanging "Export Sample Images" step #110