Open yanglu1994 opened 3 years ago
Hi @yanglu1994,
The model uses the stopping mechanism proposed in the original Tacotron paper i.e. When a frame of all zeros is predicted generation is terminated. I haven't experienced any run on audio so could you share an example input where you are seeing this?
It is happened occasionally, so I think model need a stop_protect_layer.
Thanks for sharing. Could you also send the text input so that I can try it out on my side?
My data is Chinese
Oh , I see. Based on the scale of your mel-spectrogram it looks like you're using different preprocessing steps? The way I set it up is that the mel-spectrogram magnitudes range from -1 to just above 0 so at this line and this line when I add padding for the end of sentence I use a value of 0. Did you also change this? If you didn't that could help explain the issue.
The way I set it up is that the mel-spectrogram magnitudes range from -4 to just above 4, I will change my pading_value and retrain. when I generate the mel, the stop_thershold is-0.2. Should I change stop_thershold to -2 or -3?
@yanglu1994, I would advise setting the padding to +4 and the stop_threshold to something like 3.5. The reason to go this way around is that you want the end of utterance frame to be different from a typical silent frame. If you sent the padding to - 4 the model will might stop early at a pause in the sentence.
Let me know how it goes.
In my opinion the padding value must be no energy, so if mel-spectrogram range from -4 to 4, the padding value should be -4. If mel-spectrogram range from -1 to 0, the padding value should be -1.
No, I disagree. If you do that then how will the model distinguish between the end of the utterance and pauses or silent frames somewhere in the middle of the utterance? For example, if you have the sentence "Hello, how are you?" the pause at the comma will have low energy, near -4, in your case. So if you set stop_threshold to -3.5 then the model will stop generating early. On the other hand, if you use a frame of all 4s, it doesn't appear as a natural part of speech so the model can learn to identify the end of the utterance.
That's basically what I used for the LJSpeech model and it seems to work really well so I'd recommend giving it a try.
Hi @yanglu1994, just wanted to follow up to check if you got your model to work?
according to the stop_threshold to end sentence is not correct sometimes. I still add a stop prediction layere to decide ending position.Now I have solved this problem. Thank you anyway~
I am wondered if the attention parameters should be changed in different sample rate ? I find data of 16k can't synthesize correct just like data of 48k. the end of 16k alignment always like this. ![Uploading alignment_0.png…]()
I met a similar problem that the synthesized audio is too long and the speech ends suddenly. I would like to ask how many steps does it take for you to get tacotron-ljspeech-yspjx3.pt? This pretrained model works well for the long utterance, but the one I trained fails.
For instance, the text is "In this work we choose a low-level acoustic representation: mel-frequency, to bridge the two components. Using a representation that is easily computed from time-domain wave forms allows us to train the two components separately. This representation is also smoother than wave form samples and is easier to train using a squared error loss because it is invariant to phase within each frame." My model will output speech until "two components", and follows with a long silence.
there is no stop-protection layer in model, so sometimes the generated audio will be too long.