Cannot resume training without quality loss

AndroYD84 commented 4 years ago

Due to unfortunate circumstances my training process has terminated abruptly, any attempt to resume it has resulted in a model that was much worse than before the interruption happened, even after days of training it doesn't seem to get back to the quality it used to be, as if it was unrecoverable, I attempted 2 different methods: 1) I edit this line in train.py with the lastest checkpoint number (ie. "123456" if checkpoint filename is "checkpoint_123456"), I checked the log with tensorboard but it doesn't seem to contain any information about the latest epoch so I leave it to 0 (I guess that is purely cosmetical and only resets the epoch counter to 0 but results shouldn't be affected, right?), I make a backup and run python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start just like when I began training and looks like it resumes from that latest iteration I pointed it to, but results are way worse than it used to be even after 3 days of training. 2) I revert train.py as it was originally and begin training from scratch, but warm starting using my latest model (mylatestmodel.pt) instead of the provided LibriTTS pretrained model (mellotron_libritts.pt) from this repo, so I run python train.py --output_directory=outdir --log_directory=logdir -c models/mylatestmodel.pt --warm_start but the generated results sound even worse than (1) after 2 days of training. Is it actually possible to resume an interrupted training back to it's tracks? If so, what is the correct method? It can be quite frustating losing days/hours of training because of an incident beyond our control. I think a console logger could be a useful addition too, if the terminal window gets closed unexpectedly you'd still have a record of it, in my case, it would have been useful to check how many epochs it reached before the training was suddenly interrupted, even if it was purely cosmetical.

rafaelvalle commented 4 years ago

By default a checkpoint is saved every 500 iterations.

AndroYD84 commented 4 years ago

Yes, but that's not the issue, imagine I trained for weeks up to 300.400 iterations and a black out happened, I'd lose only 400 iterations of progress but still have a "checkpoint_300000" file, is it possible to resume training from this checkpoint? Any attempt I made to resume from a checkpoint have generated a model that sounded much worse than its predecessor (checkpoint_300000" file), I know sometimes resuming a training requires some warming up before returning to its original state, but this isn't happening after a week, the results are not even close to its predecessor, if I had a time machine and could have prevented the blackout now the new checkpoint (ie. checkpoint_400000) would have sounded better not worse than before, do I have to start over again from scratch and lose weeks of training or I did something wrong? Thanks for your patience.

texpomru13 commented 4 years ago

@AndroYD84 try changing the hyperparameter before resuming from checkpoint: ignore_layers=[] and use_saved_learning_rate=True

rafaelvalle commented 4 years ago

Note that when using --warm_start does not include the optimizer. When resuming from your own model. you should not include --warm_start.

AndroYD84 commented 4 years ago

Thank you for your help, I resumed training without the "warm_start" option and I confirm that so far I haven't noticed any quality loss, I haven't tried texpomru13 solution as the results were already improving without the need to change anything. However, if at some point I notice that the model is not improving any further then I plan to test the other solution as well, but right now I didn't want to jinx it.

NVIDIA / mellotron

Cannot resume training without quality loss #30