FastSpeech 2: duration model produces zeros

Acrobot commented 4 years ago

Hi, first of all thanks for the great repository, it's really helping me in my work!

I have a problem with FastSpeech 2 with the duration model. Using the standard configuration (the same as for LJSpeech), but own dataset (~12 GBs of recordings), training on phonemes + having external alignment, the model produces speech which is understandable, but sometimes forgets to say a phoneme. This is due to the duration model outputting zeros.

Here is a sample from the duration model (run on the training dataset!):

array([ 2,  3,  3,  7,  8,  3,  4, 10, 10,  4,  5,  3,  6,  4,  8,  5,  7,
        3,  8,  3,  3, 10,  3,  5,  5,  5,  5,  3, 13,  0, 13, 27,  7,  9,
        4,  6, 14, 13,  6,  8,  0,  4,  3,  3,  7,  3,  6, 16, 16],
      dtype=int32)

and here is the ground truth:

array([ 4.,  4.,  4., 10.,  9.,  3.,  4.,  9., 10.,  3.,  6.,  4.,  8.,
        4.,  9.,  7.,  8.,  3.,  7.,  3.,  3., 10.,  4.,  5.,  6.,  6.,
        5.,  3., 16., 22., 12., 35.,  8.,  9.,  4.,  4., 15., 13.,  7.,
        8.,  9.,  4.,  3.,  4.,  8.,  3.,  5., 14., 17.])

I unfortunately cannot share my dataset or samples, however the same alignment files were used successfully in production to train a different network. I converted them to the same format that the MFA scripts produced, so that I could use trim_mfa. I found that by converting the exact durations to frames I am losing a bit of precision: 0.12 seconds out of 3.26 seconds of audio. Of course, the last phoneme is padded so that we still maintain the matching number of frames to mel-spectrogram.

Here are some of my test runs. The orange and blue runs are with tweaked hyperparameters, red one is stock LJSpeech hyperparameters and pre-training on the model provided here in the repo. I have also run a model for 800k iters using the stock LJSpeech hyperparameters without pre-training, but the model still produces zeros. TensorBoard screenshot

Does anyone have any ideas on what I might be doing wrong/why does the duration model output zeros?

ZDisket commented 4 years ago

@Acrobot 400 to 800k iters sounds a bit excessive even if your dataset is big, I find that the best point for phonetic MFA-aligned LJSpeech in my experiments to be about 60k and often overfit at about 20k+ when finetuning with another dataset after which pronunciations start to suffer, have you tried earlier checkpoints? FastSpeech2 is easy to train but also very sensitive to overfitting. At 160k, my LJSpeech started suffering a lot.

dathudeptrai commented 4 years ago

@Acrobot zero duration is ok, that represent 2 cases:

Unvoice charactor/phoneme.
Inconsistent between text input and audio, for example the text have a word "dog" but an audio don't have.

BTW, i never training the model > 200k steps :)). When energy and f0 begin overfit, the best checkpoint should be around that steps. Based on ur tensorboard, for example the orange model, the best checkpoint is around 100k-150k steps.

Acrobot commented 4 years ago

@ZDisket, @dathudeptrai, thank you for the insights! I have checked the models at earlier checkpoints as well and they unfortunately still skip some phonemes. In my case, the sentence I am testing (this time from the validation set) is "which song or artist do you want the station for" and it is skipping the "o" in "song" and "a" in "artist" and "want"

Predictions from the duration network (from the blue model above, taken from 50k iters):

array([ 5,  4,  9, 11,  0,  7,  9,  8,  0,  4,  3,  5,  8,  3,  4,  3,  4,
        3,  9,  0,  3,  3,  3,  3, 10,  4, 10, 10,  8,  9,  9, 14,  0,  0,
        0,  0], dtype=int32)

actual ground truth:

array([ 5,  4, 10,  4,  7,  6,  6,  8,  9,  4,  3,  5,  8,  3,  3,  3,  3,
        3,  7,  5,  3,  3,  3,  3,  9,  4, 10,  9,  8, 10,  9, 15],
      dtype=int32)

Phonemes and their (predicted) lengths:

[('w', 5), ('ih', 4), ('ch', 9), ('s', 11), ('aa', 0), ('ng', 7), ('ao', 9), ('r', 8), ('aa', 0), ('r', 4), ('t', 3), ('ih', 5), ('s', 8), ('t', 3), ('d', 4), ('uw', 3), ('j', 4), ('uw', 3), ('w', 9), ('aa', 0), ('n', 3), ('t', 3), ('dh', 3), ('ax', 3), ('s', 10), ('d', 4), ('ey', 10), ('sh', 10), ('nx', 8), ('f', 9), ('ao', 9), ('r', 14)]

(actually, looking at it now, it seems to have learned the 'aa' phoneme is always 0-length, it is the first phoneme in my symbols list. Is there an assumption that the pad token should be first? Because if that's the case, that might be the root cause of the problem, my symbols are sorted alphabetically and pad is only at index 32!)

As you can see, the phonemes should definitely have a duration greater than zero. When I generated the mel-spectrogram, feeding the network ground-truth durations instead of using the duration network, I get speech that is not perfect, but close enough and not missing any phonemes.

I have also synthesised the same speech from extracted ground-truth spectrograms using MB-MelGAN and the audio is correct, it is saying the same sentence.

I am also training Tacotron 2 in order to be able to align the dataset using that instead of our own forced alignment, but this is probably going to need 2-3 more days to train (currently at ~20k iters).

dathudeptrai commented 4 years ago

@Acrobot padding always at 0-idx.

Acrobot commented 4 years ago

@dathudeptrai Oh, I see. Thank you, in this case I think I will be able to train the network properly! I'm going to leave this open for now, but it's probably the root cause of the issue.

Acrobot commented 4 years ago

That was the problem, thanks for the help!

abylouw commented 4 years ago

Hi @dathudeptrai

I assume the padding at 0-idx comes from https://github.com/TensorSpeech/TensorFlowTTS/blob/555bf4d777211e5a4d03d13ebaca2970250378a9/tensorflow_tts/models/fastspeech.py#L168 ?

What is the pad symbol? Is it the same as a pause or silence?

Thank you

dathudeptrai commented 4 years ago

Hi @dathudeptrai

I assume the padding at 0-idx comes from

https://github.com/TensorSpeech/TensorFlowTTS/blob/555bf4d777211e5a4d03d13ebaca2970250378a9/tensorflow_tts/models/fastspeech.py#L168

? What is the pad symbol? Is it the same as a pause or silence?

Thank you

pad positional_embedding is a zero vector.

TensorSpeech / TensorFlowTTS

FastSpeech 2: duration model produces zeros #256