:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
I am a newbie to Text-to-speech (TTS) problems. I train the Tacotron2 model on the KSS dataset (Korean language dataset) from scratch. After training, this model can produce good speech audios. Performances on the validation set are:
Stop token loss: 0.0000
Mel spectrogram loss (before Postnet): 0.1331
Mel spectrogram loss (after Postnet): 0.1089
Guided attention loss: 0.0008
There is one problem that I have no idea how to solve. Given a short text like "윤 후보는 앞서 경기 구리시", Tacotron2 takes nearly 13 seconds to generate the Mel-spectrogram. The alignment figure returned from the Local Sensitive Attention module in Tacotron2 show that it can align all characters well for the first few seconds. If Tacotron2 stops generating the spectrogram right there, the audio (produced by the vocoder, I use ParallelWaveGAN) is good. However, it keeps generating this spectrogram for around 10 seconds more, making the audio sound noisy with many nonsense words then.
One trick to solve this problem is adding a 'dot' at the end of the text (e.g. 윤 후보는 앞서 경기 구리시.), but this trick is unrealistic and sometimes inefficient.
What should I do to solve this problem? Thank you for your advice.
I am a newbie to Text-to-speech (TTS) problems. I train the Tacotron2 model on the KSS dataset (Korean language dataset) from scratch. After training, this model can produce good speech audios. Performances on the validation set are:
There is one problem that I have no idea how to solve. Given a short text like "윤 후보는 앞서 경기 구리시", Tacotron2 takes nearly 13 seconds to generate the Mel-spectrogram. The alignment figure returned from the Local Sensitive Attention module in Tacotron2 show that it can align all characters well for the first few seconds. If Tacotron2 stops generating the spectrogram right there, the audio (produced by the vocoder, I use ParallelWaveGAN) is good. However, it keeps generating this spectrogram for around 10 seconds more, making the audio sound noisy with many nonsense words then.
One trick to solve this problem is adding a 'dot' at the end of the text (e.g. 윤 후보는 앞서 경기 구리시.), but this trick is unrealistic and sometimes inefficient.
What should I do to solve this problem? Thank you for your advice.