[Bug] --continue-path / resuming training from an existing job does not work

jreus commented 2 years ago

🐛 Description

Attempting to pick up a previous/cancelled training session from the last checkpoint does not work as expected.

To Reproduce

Given a previous training of tacotron2 found in coqui_tts-December-19-2021_10+40PM-0000000/ ~ with the last checkpoint being checkpoint_100000.pth.tar ...the following command should be expected to pick up from that last checkpoint and continue training

CUDA_VISIBLE_DEVICES=1 python ~/TTS/TTS/bin/train_tts.py --continue_path coqui_tts-December-19-2021_10+40PM-0000000/

However, instead it begins a new training job within the directory specified by --continue_path ~ beginning at GLOBAL_STEP 0

Output:

| > Found 13100 files in /its/home/jr586/datasets/tts/LJSpeech-1.1

Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > log_func:np.log
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:False
| > symmetric_norm:True
| > mel_fmin:0
| > mel_fmax:8000.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > stats_path:None
| > base:2.718281828459045
| > hop_length:256
| > win_length:1024
Using model: tacotron2
Using CUDA: True
Number of GPUs: 1

Model has 52676308 parameters

Number of output frames: 6

EPOCH: 0/3000
--> coqui_tts-December-19-2021_10+40PM-0000000/

DataLoader initialization
| > Use phonemes: True
| > phoneme language: en-us
| > Number of instances : 12969
| > Max length sequence: 188
| > Min length sequence: 13
| > Avg length sequence: 100.90014650319993
| > Num. instances discarded by max-min (max=150, min=1) seq limits: 747
| > Batch group size: 0.

TRAINING (2021-12-24 11:57:34)
/its/home/jr586/TTS/TTS/tts/models/tacotron2.py:268: UserWarning: __floordiv__ is deprecated
, and its behavior will change in a future version of pytorch. It currently rounds toward 0
(like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative val
ues. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual
floor division, use torch.div(a, b, rounding_mode='floor').
alignment_lengths = (
/its/home/jr586/.conda/envs/ml/lib/python3.9/site-packages/torch/functional.py:445: UserWarn
ing: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argume
nt. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/
native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
--> STEP: 0/381 -- GLOBAL_STEP: 0
| > decoder_loss: 34.07442 (34.07442)
| > postnet_loss: 36.28318 (36.28318)
| > stopnet_loss: 1.43249 (1.43249)
| > decoder_coarse_loss: 34.07126 (34.07126)
| > decoder_ddc_loss: 0.00982 (0.00982)
| > ga_loss: 0.02193 (0.02193)
| > decoder_diff_spec_loss: 0.39038 (0.39038)
| > postnet_diff_spec_loss: 4.94285 (4.94285)
| > decoder_ssim_loss: 0.64225 (0.64225)
| > postnet_ssim_loss: 0.64165 (0.64165)
| > loss: 29.30612 (29.30612)
| > align_error: 0.94021 (0.94021)
| > grad_norm: 6.09056 (6.09056)
| > current_lr: 2.5000000000000002e-08
| > step_time: 0.58450 (0.58452)
| > loader_time: 0.34360 (0.34361)

This is a known bug

I've mentioned this on the coqui matrix chat, @WeberJulian has said that this is a known bug, and that --continue_path is not working as it should. The current fix is to modify train_tts.py to set parse_command_line_args=True when creating the Trainer (line 59)

Environment

🐸TTS Version: >>> TTS.version = 0.4.2
PyTorch Version: 1.10
Python version: 3.9
OS (e.g., Linux): ubuntu 20.04
CUDA/cuDNN version: 11.3
GPU models and configuration: NVIDIA Titan V
How you installed PyTorch (conda, pip, source): conda

thorstenMueller commented 2 years ago

I can confirm this. But this has been already fixed by @WeberJulian in this PR (https://github.com/coqui-ai/TTS/commit/23d789c0722afe88f0abf3b679ee9199d877eb7a#diff-18ac0d5b5b29ace6dde3a4dc7a18ef3822992377781e2e32c058dc4270d7a1c9) on 20.12.2021. With this commit (https://github.com/coqui-ai/TTS/commit/85418ffeaa93cda22ac0be30855f55a33b64ce13#diff-18ac0d5b5b29ace6dde3a4dc7a18ef3822992377781e2e32c058dc4270d7a1c9) that change has been overwritten by @Edresson . Maybe a merge conflict.

In my case it worked by setting parse_command_line_args to true in train_tts.py.

r7sa commented 2 years ago

It seems that there is a similar error in training the vocoder: https://github.com/coqui-ai/TTS/blob/c63bb481e95bd4a1ff978947d8e3e6c0bfb4177f/TTS/bin/train_vocoder.py#L64

r7sa commented 2 years ago

One more question. As I understand - continue_path is for continue after some crash (for example by power) and restore_path - for start new train from existing model. When we use continue_path model is restored by use restore_path functionality (restore_model function), and next code: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L449-L455 always drop learn rate to start value. To correct this, here: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L312-L317 .last_epoch sets to last global step. But it is not correct/work. Not all schedulers estimate LR by last_epoch (for example torch.optim.lr_scheduler.ExponentialLR - doesn't). Moreover if scheduler_after_epoch used, then LR does not directly depend on current global step. It seems that right way is to save scheduler(s) state into checkpoint (by method state_dict), and restore in lines L312-L317 (by method load_state_dict). Is it correct?

iamanigeeit commented 2 years ago

One more question. As I understand - continue_path is for continue after some crash (for example by power) and restore_path - for start new train from existing model. When we use continue_path model is restored by use restore_path functionality (restore_model function), and next code: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L449-L455 always drop learn rate to start value. To correct this, here: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L312-L317 .last_epoch sets to last global step. But it is not correct/work. Not all schedulers estimate LR by last_epoch (for example torch.optim.lr_scheduler.ExponentialLR - doesn't). Moreover if scheduler_after_epoch used, then LR does not directly depend on current global step. It seems that right way is to save scheduler(s) state into checkpoint (by method state_dict), and restore in lines L312-L317 (by method load_state_dict). Is it correct?

I second this, it is surprising to see learning rate reset when i resume training. A temporary workaround is to remove / comment out L449-455 which resets the LR.

    if isinstance(self.optimizer, list):
        for idx, optim in enumerate(optimizer):
            for group in optim.param_groups:
                group["lr"] = self.get_lr(model, config)[idx]
    else:
        for group in optimizer.param_groups:
            group["lr"] = self.get_lr(model, config)

Then creating the new scheduler in Trainer.__init__() will use the resumed LR.

r7sa commented 2 years ago

I am firstly try to do it. But it is not work correctly because after restarting scheduler has state as for step 0.

I think that right way is to store/restore state of scheduler. I write this code (it relative simple), and can make PR but don't understand which brunch is actual - main, dev or something else?

Ca-ressemble-a-du-fake commented 2 years ago

@r7sa They wrote somewhere that they don't accept PR on main, but only to dev.

WeberJulian commented 2 years ago

The issue is fixed by https://github.com/coqui-ai/Trainer/commit/61232cdefc5977f0b847b55f64bb13d09e292d11. It should be patched in coqui-ai/TTS, when the new trainer version is released and TTS upgrades to it.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

cergo123 commented 1 year ago

I cannot find that part to comment @iamanigeeit ! I am having the same issue where current_lr is reset to 0.00000 on every resume

codepharmer commented 10 months ago

@erogol still seeing this issue. lr resets on continue training.

I've cloned the latest version of coqui-tts

coqui-ai / TTS