Closed Acrobot closed 4 years ago
@Acrobot 400 to 800k iters sounds a bit excessive even if your dataset is big, I find that the best point for phonetic MFA-aligned LJSpeech in my experiments to be about 60k and often overfit at about 20k+ when finetuning with another dataset after which pronunciations start to suffer, have you tried earlier checkpoints? FastSpeech2 is easy to train but also very sensitive to overfitting. At 160k, my LJSpeech started suffering a lot.
@Acrobot zero duration is ok, that represent 2 cases:
BTW, i never training the model > 200k steps :)). When energy and f0 begin overfit, the best checkpoint should be around that steps. Based on ur tensorboard, for example the orange model, the best checkpoint is around 100k-150k steps.
@ZDisket, @dathudeptrai, thank you for the insights! I have checked the models at earlier checkpoints as well and they unfortunately still skip some phonemes. In my case, the sentence I am testing (this time from the validation set) is "which song or artist do you want the station for" and it is skipping the "o" in "song" and "a" in "artist" and "want"
Predictions from the duration network (from the blue model above, taken from 50k iters):
array([ 5, 4, 9, 11, 0, 7, 9, 8, 0, 4, 3, 5, 8, 3, 4, 3, 4,
3, 9, 0, 3, 3, 3, 3, 10, 4, 10, 10, 8, 9, 9, 14, 0, 0,
0, 0], dtype=int32)
actual ground truth:
array([ 5, 4, 10, 4, 7, 6, 6, 8, 9, 4, 3, 5, 8, 3, 3, 3, 3,
3, 7, 5, 3, 3, 3, 3, 9, 4, 10, 9, 8, 10, 9, 15],
dtype=int32)
Phonemes and their (predicted) lengths:
[('w', 5), ('ih', 4), ('ch', 9), ('s', 11), ('aa', 0), ('ng', 7), ('ao', 9), ('r', 8), ('aa', 0), ('r', 4), ('t', 3), ('ih', 5), ('s', 8), ('t', 3), ('d', 4), ('uw', 3), ('j', 4), ('uw', 3), ('w', 9), ('aa', 0), ('n', 3), ('t', 3), ('dh', 3), ('ax', 3), ('s', 10), ('d', 4), ('ey', 10), ('sh', 10), ('nx', 8), ('f', 9), ('ao', 9), ('r', 14)]
(actually, looking at it now, it seems to have learned the 'aa' phoneme is always 0-length, it is the first phoneme in my symbols list. Is there an assumption that the pad token should be first? Because if that's the case, that might be the root cause of the problem, my symbols are sorted alphabetically and pad is only at index 32!)
As you can see, the phonemes should definitely have a duration greater than zero. When I generated the mel-spectrogram, feeding the network ground-truth durations instead of using the duration network, I get speech that is not perfect, but close enough and not missing any phonemes.
I have also synthesised the same speech from extracted ground-truth spectrograms using MB-MelGAN and the audio is correct, it is saying the same sentence.
I am also training Tacotron 2 in order to be able to align the dataset using that instead of our own forced alignment, but this is probably going to need 2-3 more days to train (currently at ~20k iters).
@Acrobot padding always at 0-idx.
@dathudeptrai Oh, I see. Thank you, in this case I think I will be able to train the network properly! I'm going to leave this open for now, but it's probably the root cause of the issue.
That was the problem, thanks for the help!
Hi @dathudeptrai
I assume the padding at 0-idx comes from https://github.com/TensorSpeech/TensorFlowTTS/blob/555bf4d777211e5a4d03d13ebaca2970250378a9/tensorflow_tts/models/fastspeech.py#L168 ?
What is the pad symbol? Is it the same as a pause or silence?
Thank you
Hi @dathudeptrai
I assume the padding at 0-idx comes from
? What is the pad symbol? Is it the same as a pause or silence?
Thank you
pad positional_embedding is a zero vector.
Hi, first of all thanks for the great repository, it's really helping me in my work!
I have a problem with FastSpeech 2 with the duration model. Using the standard configuration (the same as for LJSpeech), but own dataset (~12 GBs of recordings), training on phonemes + having external alignment, the model produces speech which is understandable, but sometimes forgets to say a phoneme. This is due to the duration model outputting zeros.
Here is a sample from the duration model (run on the training dataset!):
and here is the ground truth:
I unfortunately cannot share my dataset or samples, however the same alignment files were used successfully in production to train a different network. I converted them to the same format that the MFA scripts produced, so that I could use
trim_mfa
. I found that by converting the exact durations to frames I am losing a bit of precision: 0.12 seconds out of 3.26 seconds of audio. Of course, the last phoneme is padded so that we still maintain the matching number of frames to mel-spectrogram.Here are some of my test runs. The orange and blue runs are with tweaked hyperparameters, red one is stock LJSpeech hyperparameters and pre-training on the model provided here in the repo. I have also run a model for 800k iters using the stock LJSpeech hyperparameters without pre-training, but the model still produces zeros.
Does anyone have any ideas on what I might be doing wrong/why does the duration model output zeros?