Training starts from 0 after resuming, even though the log says it resumed.

orenong commented 4 years ago

Hey, I tried to return to the model I trained, I specified the full path of the PKL after the --reusme flag. In the log it claimed that it loaded the PKL


[Constructing networks...
Setting up TensorFlow plugin "fused_bias_act.cu": Loading... Done.
Setting up TensorFlow plugin "upfirdn_2d.cu": Loading... Done.
Resuming from "./results/00000-FLAGS1-mirror-mirrory-11gb-gpu-bg-resumeffhq512/network-snapshot-000032.pkl"

G                             Params    OutputShape         WeightShape     
---                           ---       ---                 ---             
latents_in                    -         (?, 512)            -               
labels_in                     -         (?, 0)              -               
G_mapping/Normalize           -         (?, 512)            -               
G_mapping/Dense0              262656    (?, 512)            (512, 512)      
G_mapping/Dense1              262656    (?, 512)            (512, 512)      
G_mapping/Broadcast           -         (?, 16, 512)        -               
dlatent_avg                   -         (512,)              -               
Truncation/Lerp               -         (?, 16, 512)        -               
G_synthesis/4x4/Const         8192      (?, 512, 4, 4)      (1, 512, 4, 4)  
G_synthesis/4x4/Conv          2622465   (?, 512, 4, 4)      (3, 3, 512, 512)
G_synthesis/4x4/ToRGB         264195    (?, 3, 4, 4)        (1, 1, 512, 3)  
G_synthesis/8x8/Conv0_up      2622465   (?, 512, 8, 8)      (3, 3, 512, 512)
G_synthesis/8x8/Conv1         2622465   (?, 512, 8, 8)      (3, 3, 512, 512)
G_synthesis/8x8/Upsample      -         (?, 3, 8, 8)        -               
G_synthesis/8x8/ToRGB         264195    (?, 3, 8, 8)        (1, 1, 512, 3)  
G_synthesis/16x16/Conv0_up    2622465   (?, 512, 16, 16)    (3, 3, 512, 512)
G_synthesis/16x16/Conv1       2622465   (?, 512, 16, 16)    (3, 3, 512, 512)
G_synthesis/16x16/Upsample    -         (?, 3, 16, 16)      -               
G_synthesis/16x16/ToRGB       264195    (?, 3, 16, 16)      (1, 1, 512, 3)  
G_synthesis/32x32/Conv0_up    2622465   (?, 512, 32, 32)    (3, 3, 512, 512)
G_synthesis/32x32/Conv1       2622465   (?, 512, 32, 32)    (3, 3, 512, 512)
G_synthesis/32x32/Upsample    -         (?, 3, 32, 32)      -               
G_synthesis/32x32/ToRGB       264195    (?, 3, 32, 32)      (1, 1, 512, 3)  
G_synthesis/64x64/Conv0_up    2622465   (?, 512, 64, 64)    (3, 3, 512, 512)
G_synthesis/64x64/Conv1       2622465   (?, 512, 64, 64)    (3, 3, 512, 512)
G_synthesis/64x64/Upsample    -         (?, 3, 64, 64)      -               
G_synthesis/64x64/ToRGB       264195    (?, 3, 64, 64)      (1, 1, 512, 3)  
G_synthesis/128x128/Conv0_up  1442561   (?, 256, 128, 128)  (3, 3, 512, 256)
G_synthesis/128x128/Conv1     721409    (?, 256, 128, 128)  (3, 3, 256, 256)
G_synthesis/128x128/Upsample  -         (?, 3, 128, 128)    -               
G_synthesis/128x128/ToRGB     132099    (?, 3, 128, 128)    (1, 1, 256, 3)  
G_synthesis/256x256/Conv0_up  426369    (?, 128, 256, 256)  (3, 3, 256, 128)
G_synthesis/256x256/Conv1     213249    (?, 128, 256, 256)  (3, 3, 128, 128)
G_synthesis/256x256/Upsample  -         (?, 3, 256, 256)    -               
G_synthesis/256x256/ToRGB     66051     (?, 3, 256, 256)    (1, 1, 128, 3)  
G_synthesis/512x512/Conv0_up  139457    (?, 64, 512, 512)   (3, 3, 128, 64) 
G_synthesis/512x512/Conv1     69761     (?, 64, 512, 512)   (3, 3, 64, 64)  
G_synthesis/512x512/Upsample  -         (?, 3, 512, 512)    -               
G_synthesis/512x512/ToRGB     33027     (?, 3, 512, 512)    (1, 1, 64, 3)   
---                           ---       ---                 ---             
Total                         28700647                                      

D                    Params    OutputShape         WeightShape     
---                  ---       ---                 ---             
images_in            -         (?, 3, 512, 512)    -               
labels_in            -         (?, 0)              -               
512x512/FromRGB      256       (?, 64, 512, 512)   (1, 1, 3, 64)   
512x512/Conv0        36928     (?, 64, 512, 512)   (3, 3, 64, 64)  
512x512/Conv1_down   73856     (?, 128, 256, 256)  (3, 3, 64, 128) 
512x512/Skip         8192      (?, 128, 256, 256)  (1, 1, 64, 128) 
256x256/Conv0        147584    (?, 128, 256, 256)  (3, 3, 128, 128)
256x256/Conv1_down   295168    (?, 256, 128, 128)  (3, 3, 128, 256)
256x256/Skip         32768     (?, 256, 128, 128)  (1, 1, 128, 256)
128x128/Conv0        590080    (?, 256, 128, 128)  (3, 3, 256, 256)
128x128/Conv1_down   1180160   (?, 512, 64, 64)    (3, 3, 256, 512)
128x128/Skip         131072    (?, 512, 64, 64)    (1, 1, 256, 512)
64x64/Conv0          2359808   (?, 512, 64, 64)    (3, 3, 512, 512)
64x64/Conv1_down     2359808   (?, 512, 32, 32)    (3, 3, 512, 512)
64x64/Skip           262144    (?, 512, 32, 32)    (1, 1, 512, 512)
32x32/Conv0          2359808   (?, 512, 32, 32)    (3, 3, 512, 512)
32x32/Conv1_down     2359808   (?, 512, 16, 16)    (3, 3, 512, 512)
32x32/Skip           262144    (?, 512, 16, 16)    (1, 1, 512, 512)
16x16/Conv0          2359808   (?, 512, 16, 16)    (3, 3, 512, 512)
16x16/Conv1_down     2359808   (?, 512, 8, 8)      (3, 3, 512, 512)
16x16/Skip           262144    (?, 512, 8, 8)      (1, 1, 512, 512)
8x8/Conv0            2359808   (?, 512, 8, 8)      (3, 3, 512, 512)
8x8/Conv1_down       2359808   (?, 512, 4, 4)      (3, 3, 512, 512)
8x8/Skip             262144    (?, 512, 4, 4)      (1, 1, 512, 512)
4x4/MinibatchStddev  -         (?, 513, 4, 4)      -               
4x4/Conv             2364416   (?, 512, 4, 4)      (3, 3, 513, 512)
4x4/Dense0           4194816   (?, 512)            (8192, 512)     
Output               513       (?, 1)              (512, 1)        
---                  ---       ---                 ---             
Total                28982849                                      

Exporting sample images...
Replicating networks across 1 GPUs...
Initializing augmentations...
Setting up optimizers...
Constructing training graph...
Finalizing training ops...
Initializing metrics...
Training for 25000 kimg...

tick 0     kimg 0.0      time 1m 40s       sec/tick 28.1    sec/kimg 877.83  maintenance 71.9   gpumem 10.0  augment 0.000

But as you can see, it also says "kimg 0.0". it means the tranining started from 0, and I can see it really did start from 0 by looking at the output images.

I tried multiple PKL snapshots, but the problem persists.

my full arguanents were:

!python train.py --outdir ./results --snap=2 --cfg=auto --data=./datasets/secret --augpipe="bg" --mirror=True --mirrory=True --metrics=None --resume="./results/00000-FLAGS1-mirror-mirrory-11gb-gpu-bg-resumeffhq512/network-snapshot-000032.pkl" --augpipe="bg"

And I ran the training on Google Colab. I'm looking for solutions Thank you very much

woctezuma commented 4 years ago

It is on purpose. The flag is used for transfer learning, not for resuming multiple times.

yoo4471 commented 3 years ago

Is the problem solved?

If I set 'network-snapshot-00008.pkl' as the path and resume it, it will be generated from network-snapshot-000000.pkl.

I wonder if it's supposed to be like that or if there's an option to designate.

Below is the resume command I use.

python {train_path} --aug=ada --target=0.7 --mirror=1 --snap=1 --gpus=1 --metrics=none --data={dataset_path} --outdir={outdir_path} --resume={latest_network_snapshot}

woctezuma commented 3 years ago

I wonder if it's supposed to be like that or if there's an option to designate.

Check my answer above. The training will happen from iteration 0 again, starting with your checkpoint as a base model.

https://github.com/NVlabs/stylegan2-ada/blob/43eca5156619dd6ee649c70c4bc3f3cab19a5b79/training/training_loop.py#L217-L223

yoo4471 commented 3 years ago

@woctezuma, Thank you for your answer. To make it easier for me to understand, can I ask you this again?

As a result below, '2_training/netework-snapshot-0000.pkl' is the result of training using '1_training/netework-snapshot-0008.pkl' as the base model. In this way, is it right that much better trained results are produced at 300_training/network-snapshot-0008.pkl?

The reason for asking this is that the training in colab is always over at 0008.pkl. u_u

(First Training and result) 1_training/netework-snapshot-0000.pkl 1_training/netework-snapshot-0004.pkl 1_training/netework-snapshot-0008.pkl

(Second Training and result) (Setting '0_training/network-snapshot-0008.pkl' as the path for the resume.) 2_training/netework-snapshot-0000.pkl 2_training/netework-snapshot-0004.pkl 2_training/netework-snapshot-0008.pkl

woctezuma commented 3 years ago

In this way, is it right that much better trained results are produced at 300_training/network-snapshot-0008.pkl?

You cannot be sure of that: you need to monitor the metrics of interest. You should not train blindly: tweaking of training parameters can matter a lot. Check the metrics to understand when and how you should tweak the parameters.

Moreover, I would say no. I don't recommend using Colab for training from scratch with multiple resume, especially if the innovative part of the paper comes from the ADA scheduling strategy for training. The training has been stopped very early, so the whole ADA strategy has not happened this early in the training stage, and you should not expect to see good results by repeatedly applying this kind of resume process. You will be repeating the first stage of the training schedule where there is not much strategy.

If you really want to use Colab (which I don't recommend), you could have a look at https://github.com/NVlabs/stylegan2-ada/pull/6.

yoo4471 commented 3 years ago

@woctezuma I really appreciate your answer :)

xyt000-xjj commented 1 year ago

How to recover if training is interrupted

NVlabs / stylegan2-ada

Training starts from 0 after resuming, even though the log says it resumed. #30