TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.85k stars 814 forks source link

Tacotron 2 NaN loss when using pretrained model as starting point #496

Closed PedroDKE closed 3 years ago

PedroDKE commented 3 years ago

I'm trying to finetune the pretrained Tacotron 2 model on my own (English) dataset. I follow the steps explained here. My new dataset uses a processor similar to the one from the Thorsten dataset, but uses the English cleaners. To get the vocabulary length of my dataset I've added my own lines at this point in my local files to point to my custom processor/symbols.

This seems to be working as i get the following output before training:

2021-02-13 15:21:58,003 (hdf5_format:779) WARNING: Skipping loading of weights for layer encoder due to mismatch in shape ((63, 512) vs (149, 512)). 2021-02-13 15:21:59,306 (train_tacotron2:466) INFO: Successfully loaded pretrained weight from ../pretrained/tacotron2/model-120000.h5.

But during the first evaluation of the training set i get the following output: INFO: (Step: 150) train_stop_token_loss = nan. 2021-02-13 15:37:59,026 (base_trainer:1014) INFO: (Step: 150) train_mel_loss_before = nan. 2021-02-13 15:37:59,026 (base_trainer:1014) INFO: (Step: 150) train_mel_loss_after = nan. 2021-02-13 15:37:59,027 (base_trainer:1014) INFO: (Step: 150) train_guided_attention_loss = nan. i did not add any argument similar to 'var_train_expr: "embeddings|encoder|decoder" ' as i would like to train all layers. Any idea what is causing this and how i can fix it? I already disabled mixed precision training and this did not get rid of the problem.

and ass an fyi: training a model from scratch (using the same command,but leaving out the --pretrained args) does work as it should

it is probably the same problem as in #194 but using MFA i dont get accurate durations (some letters are left out, while they shouldnt). My use case is the same as in the issue, were i want to use tacotron2 to extract the durations for my dataset. Do you think that just using a tacotron2 model from scratch will do this with a reasonable accuracy? and how many iterations should i train it for?

dathudeptrai commented 3 years ago

@PedroDKE training from scratch is ok, do not need to fine-tune from our checkpoints. You should be training around 60k-80k steps to be able to get a good duration for fastspeech2.

PedroDKE commented 3 years ago

After training the tacotron2 model to extract the durations (70k steps on ~5k audio files) i made a little script to plot the MFCC's all extracted features (f0,energy and durations). In here i saw that in some (i'd say roughly 25% of the cases) the durations dont look too good (its already improved over using MFA or a pretrained tacatron2 in my case though). Should i train the tacotron2 model for more steps or would these kind of durations to be fine to train a fastspeech2 model? I have attached a screenshot of both train/eval losses and two examples where the durations are off. Any tips?

transcript: i am changed, but you must always be my friend.eos 9-00009-f000107

transcript: or was it indifferent to results ?eos 9-00008-f000331

train eval

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.