TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.84k stars 814 forks source link

🇰🇷 Korean TTS now available 😘 #183

Closed dathudeptrai closed 3 years ago

dathudeptrai commented 4 years ago

Korean TTS now available, thank Jaehyoung Kim (@crux153) for his support :D. The model used KSS dataset here (https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset). Thank @Kyubyong for the opensource korea dataset :D. The pretrained model licensed under CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0/) as same as the license of original KSS dataset.

Pls check out the colab bellow and enjoy :D.

https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing

Joovvhan commented 4 years ago

Hello @dathudeptrai, just as I promised, I made a demo page comparing the GlowTTS and the FastSpeech2 trained on the KSS dataset.

I am sorry that it took longer than I expected. I have been suffering from fever this week (fortunately not the COVID-19).

Thanks to your advice, I replace the WaveGlow vocoder with the mb.melgan vocoder from your link.

And all the FastSpeech2 samples are also made using the Google Colab page.

I also change the original Glow-TTS preprocessing process with your methods and I guess now the conditions are pretty even.

In the demo page, I included two types of sentences.

Short sentences are from the original KSS dataset and the longer ones are from other sources.

For the short sentences, the accent, phoneme duration, and pitch of the FastSpeech2 samples are much more similar to the Ground Truth samples.

On the other hand, GlowTTS generate emotionally neutral samples that are less similar to the Ground Truth, and the pitch is a bit unstable.

In addition, FastSpeech2 samples are louder in volume than the GT or GlowTTS samples, and I guess the loudness is the cause of the artificial sound effects.

I am not sure why are FS2 samples louder. Do you have any insight to be shared with me?

Finally, for the longer and untrained sentences, I segmented sentences into four or five parts to synthesize with the FS2 model, while GlowTTS did not need such a process.

Just as I mentioned earlier, GlowTTS is much more neutral and FastSpeech2 is much more like the samples included in the KSS in terms of accent and pitch. As a matter of fact, I wonder if the FS2 model is a bit overfitted.

The original authors of the GlowTTS have written that the pitch of the generated sentences can be controlled by the parameter T and this is my experimental result.

Screen Shot 2020-08-22 at 6 01 40 PM

The authors have reported that Temperature 0.333 generates the best audio quality and that is in accordance with my observation. Yet, 0.333 generates emotionless speeches and 0.667 generates speech with poor quality.

In terms of only the audio quality, GlowTTS(0.333) is better, and in terms of mimicking the original speaker, FastSpeech2 is better.

I wonder if this answer fits your expectation or not.

If you need further discussions or explanations, feel free to ask.

Thanks and your codes were amazing!

dathudeptrai commented 4 years ago

@Joovvhan hi i am not at home now, i will take a look tonight. Btw, the problem about fs2 is a bit louder than nỏrmal can be control. In glowtts we control by adjust T value, in my fs2 you can adjust it by modify f0_ratios and energy_ratios in the colab. I think you can try to adjust f0_ratio from 1,0 to 0.5/0.6/0.7, energy_ratios from 1.0 to 0.7/0.8/0.9 to see if it is better. I tríed to adjust the f0_ratios for kss and it worked, the audio generate is less loud.

Joovvhan commented 4 years ago

Thanks, I will try that and update my demo page.

BTW, I applied G2PK, Korean grapheme to phoneme conversion package, and improved the quality of the GlowTTS.

GlowTTS without G2PK GlowTTS with G2PK

Did your FS2 also use such a module to convert grapheme to phoneme?

I wonder FS2 can also be improved using the G2PK module.

dathudeptrai commented 4 years ago

Thanks, I will try that and update my demo page.

BTW, I applied G2PK, Korean grapheme to phoneme conversion package, and improved the quality of the GlowTTS.

GlowTTS without G2PK GlowTTS with G2PK

Did your FS2 also use such a module to convert grapheme to phoneme?

I wonder FS2 can also be improved using the G2PK module.

Thank you for ur infomation. I can't understand the korean language so i can't support more :D. The korean model here is supported by @crux153 :))). I appreciate if someone can make a pull request to improve a base model :D

Joovvhan commented 4 years ago

Thanks, I will try that and update my demo page. BTW, I applied G2PK, Korean grapheme to phoneme conversion package, and improved the quality of the GlowTTS. GlowTTS without G2PK GlowTTS with G2PK Did your FS2 also use such a module to convert grapheme to phoneme? I wonder FS2 can also be improved using the G2PK module.

Thank you for ur infomation. I can't understand the korean language so i can't support more :D. The korean model here is supported by @crux153 :))). I appreciate if someone can make a pull request to improve a base model :D

Thanks for the quick reply.

I will look through your source code and ask @crux153 by myself.

Thanks!

crux153 commented 4 years ago

@Joovvhan It won't take too long to apply G2PK or other Korean G2P module to this package. Actually I also thought about applying it, but since I'm using the trained model in C++ and that requires porting the G2P module itself to C++, which might take a while.

I'm satisfied with current quality of FS2 model except that it fails to synthesize long samples, and I'm trying to fix in #208. Maybe training using phoneme might help fixing this.

https://github.com/HGU-DLLAB/Korean-FastSpeech2-Pytorch

I found another FS2 implementation for Korean. Take a look if you haven't. It is trained with KSS using phoneme, and uses MFA to extract durations (instead of Tacotron2). Audio samples are glitchy, but I think WaveGlow vocoder is responsible for this, not FS2 itself. Downloading the trained model and feeding our MB-MelGAN model with mels from it will verify this, but I didn't have much time to do so :(

Please share your progress if you've got any :D

ZDisket commented 4 years ago

@crux153

but since I'm using the trained model in C++ and that requires porting the G2P module itself to C++, which might take a while.

Perhaps you should consider Phonetisaurus, which the current C++ example here already uses for G2P. It's even compatible with pretrained G2P models in Montreal Forced Aligner's page (just did for Spanish).

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.