Open orenong opened 4 years ago
It is on purpose. The flag is used for transfer learning, not for resuming multiple times.
Is the problem solved?
If I set 'network-snapshot-00008.pkl' as the path and resume it, it will be generated from network-snapshot-000000.pkl.
I wonder if it's supposed to be like that or if there's an option to designate.
Below is the resume command I use.
python {train_path} --aug=ada --target=0.7 --mirror=1 --snap=1 --gpus=1 --metrics=none --data={dataset_path} --outdir={outdir_path} --resume={latest_network_snapshot}
I wonder if it's supposed to be like that or if there's an option to designate.
Check my answer above. The training will happen from iteration 0 again, starting with your checkpoint as a base model.
@woctezuma, Thank you for your answer. To make it easier for me to understand, can I ask you this again?
As a result below, '2_training/netework-snapshot-0000.pkl' is the result of training using '1_training/netework-snapshot-0008.pkl' as the base model. In this way, is it right that much better trained results are produced at 300_training/network-snapshot-0008.pkl?
The reason for asking this is that the training in colab is always over at 0008.pkl. u_u
(First Training and result) 1_training/netework-snapshot-0000.pkl 1_training/netework-snapshot-0004.pkl 1_training/netework-snapshot-0008.pkl
(Second Training and result) (Setting '0_training/network-snapshot-0008.pkl' as the path for the resume.) 2_training/netework-snapshot-0000.pkl 2_training/netework-snapshot-0004.pkl 2_training/netework-snapshot-0008.pkl
In this way, is it right that much better trained results are produced at 300_training/network-snapshot-0008.pkl?
You cannot be sure of that: you need to monitor the metrics of interest. You should not train blindly: tweaking of training parameters can matter a lot. Check the metrics to understand when and how you should tweak the parameters.
Moreover, I would say no. I don't recommend using Colab for training from scratch with multiple resume, especially if the innovative part of the paper comes from the ADA scheduling strategy for training. The training has been stopped very early, so the whole ADA strategy has not happened this early in the training stage, and you should not expect to see good results by repeatedly applying this kind of resume process. You will be repeating the first stage of the training schedule where there is not much strategy.
If you really want to use Colab (which I don't recommend), you could have a look at https://github.com/NVlabs/stylegan2-ada/pull/6.
@woctezuma I really appreciate your answer :)
How to recover if training is interrupted
Hey, I tried to return to the model I trained, I specified the full path of the PKL after the --reusme flag. In the log it claimed that it loaded the PKL
But as you can see, it also says "kimg 0.0". it means the tranining started from 0, and I can see it really did start from 0 by looking at the output images.
I tried multiple PKL snapshots, but the problem persists.
my full arguanents were:
!python train.py --outdir ./results --snap=2 --cfg=auto --data=./datasets/secret --augpipe="bg" --mirror=True --mirrory=True --metrics=None --resume="./results/00000-FLAGS1-mirror-mirrory-11gb-gpu-bg-resumeffhq512/network-snapshot-000032.pkl" --augpipe="bg"
And I ran the training on Google Colab. I'm looking for solutions Thank you very much