TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.76k stars 803 forks source link

Tacotron2 model generates long Mel-spectrogram for a short input text #756

Closed hoangtrong2305 closed 2 years ago

hoangtrong2305 commented 2 years ago

I am a newbie to Text-to-speech (TTS) problems. I train the Tacotron2 model on the KSS dataset (Korean language dataset) from scratch. After training, this model can produce good speech audios. Performances on the validation set are:

There is one problem that I have no idea how to solve. Given a short text like "윤 후보는 앞서 경기 구리시", Tacotron2 takes nearly 13 seconds to generate the Mel-spectrogram. The alignment figure returned from the Local Sensitive Attention module in Tacotron2 show that it can align all characters well for the first few seconds. If Tacotron2 stops generating the spectrogram right there, the audio (produced by the vocoder, I use ParallelWaveGAN) is good. However, it keeps generating this spectrogram for around 10 seconds more, making the audio sound noisy with many nonsense words then.

image

One trick to solve this problem is adding a 'dot' at the end of the text (e.g. 윤 후보는 앞서 경기 구리시.), but this trick is unrealistic and sometimes inefficient.

What should I do to solve this problem? Thank you for your advice.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.