bshall / Tacotron

A PyTorch implementation of Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis
https://bshall.github.io/Tacotron/
MIT License
112 stars 24 forks source link

about stop token #2

Open yanglu1994 opened 3 years ago

yanglu1994 commented 3 years ago

there is no stop-protection layer in model, so sometimes the generated audio will be too long.

bshall commented 3 years ago

Hi @yanglu1994,

The model uses the stopping mechanism proposed in the original Tacotron paper i.e. When a frame of all zeros is predicted generation is terminated. I haven't experienced any run on audio so could you share an example input where you are seeing this?

yanglu1994 commented 3 years ago

It is happened occasionally, so I think model need a stop_protect_layer.

alignment_9 spectrogram_9

bshall commented 3 years ago

Thanks for sharing. Could you also send the text input so that I can try it out on my side?

yanglu1994 commented 3 years ago

My data is Chinese

bshall commented 3 years ago

Oh , I see. Based on the scale of your mel-spectrogram it looks like you're using different preprocessing steps? The way I set it up is that the mel-spectrogram magnitudes range from -1 to just above 0 so at this line and this line when I add padding for the end of sentence I use a value of 0. Did you also change this? If you didn't that could help explain the issue.

yanglu1994 commented 3 years ago

The way I set it up is that the mel-spectrogram magnitudes range from -4 to just above 4, I will change my pading_value and retrain. when I generate the mel, the stop_thershold is-0.2. Should I change stop_thershold to -2 or -3?

bshall commented 3 years ago

@yanglu1994, I would advise setting the padding to +4 and the stop_threshold to something like 3.5. The reason to go this way around is that you want the end of utterance frame to be different from a typical silent frame. If you sent the padding to - 4 the model will might stop early at a pause in the sentence.

Let me know how it goes.

yanglu1994 commented 3 years ago

In my opinion the padding value must be no energy, so if mel-spectrogram range from -4 to 4, the padding value should be -4. If mel-spectrogram range from -1 to 0, the padding value should be -1.

bshall commented 3 years ago

No, I disagree. If you do that then how will the model distinguish between the end of the utterance and pauses or silent frames somewhere in the middle of the utterance? For example, if you have the sentence "Hello, how are you?" the pause at the comma will have low energy, near -4, in your case. So if you set stop_threshold to -3.5 then the model will stop generating early. On the other hand, if you use a frame of all 4s, it doesn't appear as a natural part of speech so the model can learn to identify the end of the utterance.

That's basically what I used for the LJSpeech model and it seems to work really well so I'd recommend giving it a try.

bshall commented 3 years ago

Hi @yanglu1994, just wanted to follow up to check if you got your model to work?

yanglu1994 commented 3 years ago

according to the stop_threshold to end sentence is not correct sometimes. I still add a stop prediction layere to decide ending position.Now I have solved this problem. Thank you anyway~

yanglu1994 commented 3 years ago

I am wondered if the attention parameters should be changed in different sample rate ? I find data of 16k can't synthesize correct just like data of 48k. the end of 16k alignment always like this. ![Uploading alignment_0.png…]()

yanglu1994 commented 3 years ago

alignment_0

121898 commented 3 years ago

I met a similar problem that the synthesized audio is too long and the speech ends suddenly. I would like to ask how many steps does it take for you to get tacotron-ljspeech-yspjx3.pt? This pretrained model works well for the long utterance, but the one I trained fails.

For instance, the text is "In this work we choose a low-level acoustic representation: mel-frequency, to bridge the two components. Using a representation that is easily computed from time-domain wave forms allows us to train the two components separately. This representation is also smoother than wave form samples and is easier to train using a squared error loss because it is invariant to phase within each frame." My model will output speech until "two components", and follows with a long silence.