Closed jreus closed 2 years ago
I can confirm this. But this has been already fixed by @WeberJulian in this PR (https://github.com/coqui-ai/TTS/commit/23d789c0722afe88f0abf3b679ee9199d877eb7a#diff-18ac0d5b5b29ace6dde3a4dc7a18ef3822992377781e2e32c058dc4270d7a1c9) on 20.12.2021. With this commit (https://github.com/coqui-ai/TTS/commit/85418ffeaa93cda22ac0be30855f55a33b64ce13#diff-18ac0d5b5b29ace6dde3a4dc7a18ef3822992377781e2e32c058dc4270d7a1c9) that change has been overwritten by @Edresson . Maybe a merge conflict.
In my case it worked by setting parse_command_line_args
to true
in train_tts.py
.
It seems that there is a similar error in training the vocoder: https://github.com/coqui-ai/TTS/blob/c63bb481e95bd4a1ff978947d8e3e6c0bfb4177f/TTS/bin/train_vocoder.py#L64
One more question. As I understand - continue_path is for continue after some crash (for example by power) and restore_path - for start new train from existing model. When we use continue_path model is restored by use restore_path functionality (restore_model function), and next code: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L449-L455 always drop learn rate to start value. To correct this, here: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L312-L317 .last_epoch sets to last global step. But it is not correct/work. Not all schedulers estimate LR by last_epoch (for example torch.optim.lr_scheduler.ExponentialLR - doesn't). Moreover if scheduler_after_epoch used, then LR does not directly depend on current global step. It seems that right way is to save scheduler(s) state into checkpoint (by method state_dict), and restore in lines L312-L317 (by method load_state_dict). Is it correct?
One more question. As I understand - continue_path is for continue after some crash (for example by power) and restore_path - for start new train from existing model. When we use continue_path model is restored by use restore_path functionality (restore_model function), and next code: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L449-L455 always drop learn rate to start value. To correct this, here: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L312-L317 .last_epoch sets to last global step. But it is not correct/work. Not all schedulers estimate LR by last_epoch (for example torch.optim.lr_scheduler.ExponentialLR - doesn't). Moreover if scheduler_after_epoch used, then LR does not directly depend on current global step. It seems that right way is to save scheduler(s) state into checkpoint (by method state_dict), and restore in lines L312-L317 (by method load_state_dict). Is it correct?
I second this, it is surprising to see learning rate reset when i resume training. A temporary workaround is to remove / comment out L449-455 which resets the LR.
if isinstance(self.optimizer, list):
for idx, optim in enumerate(optimizer):
for group in optim.param_groups:
group["lr"] = self.get_lr(model, config)[idx]
else:
for group in optimizer.param_groups:
group["lr"] = self.get_lr(model, config)
Then creating the new scheduler in Trainer.__init__()
will use the resumed LR.
I am firstly try to do it. But it is not work correctly because after restarting scheduler has state as for step 0.
I think that right way is to store/restore state of scheduler. I write this code (it relative simple), and can make PR but don't understand which brunch is actual - main, dev or something else?
@r7sa They wrote somewhere that they don't accept PR on main, but only to dev.
The issue is fixed by https://github.com/coqui-ai/Trainer/commit/61232cdefc5977f0b847b55f64bb13d09e292d11. It should be patched in coqui-ai/TTS, when the new trainer version is released and TTS upgrades to it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.
I cannot find that part to comment @iamanigeeit ! I am having the same issue where current_lr is reset to 0.00000 on every resume
@erogol still seeing this issue. lr resets on continue training.
I've cloned the latest version of coqui-tts
π Description
Attempting to pick up a previous/cancelled training session from the last checkpoint does not work as expected.
To Reproduce
Given a previous training of tacotron2 found in
coqui_tts-December-19-2021_10+40PM-0000000/
~ with the last checkpoint beingcheckpoint_100000.pth.tar
...the following command should be expected to pick up from that last checkpoint and continue trainingHowever, instead it begins a new training job within the directory specified by
--continue_path
~ beginning at GLOBAL_STEP 0Output:
This is a known bug
I've mentioned this on the coqui matrix chat, @WeberJulian has said that this is a known bug, and that --continue_path is not working as it should. The current fix is to modify
train_tts.py
to setparse_command_line_args=True
when creating the Trainer (line 59)Environment
conda
,pip
, source): conda