TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.85k stars 815 forks source link

[FastSpeech2] question about f0 and energy estimation #199

Closed Jackson-Kang closed 4 years ago

Jackson-Kang commented 4 years ago

Hello, @dathudeptrai

First of all, I'd like to thank you due to your awesome works. I've learned a lot from your repos.

Seeing your repos, I got a question regarding f0 and energy estimation.

[Question] Any reason that you have used the approach that FastPitch suggests? I think that learning pitch/energy embedding as FastSpeech2 is problematic due to sensitivity of outliers and insufficiency of samples in certain (f0 or energy) ranges. If I am correct, the model produces metalic and noisy samples. Am I right? If I am wrong, could you tell me what was the reason that you have used the approach as FastPitch does?

Again, thank you for providing this awesome work and maintaining this repos for TTS researchers. Sincerely,

dathudeptrai commented 4 years ago

@Jackson-Kang

In preprocessing step, i use pw.stonemask and write a remove_outlier so the problem about outlier is not the problem anymore :)). There are 2 reasons explain why i use continuous value for FastSpeech2 rather than concrete value (same as fastspeech2 paper said):

  1. I found that it's very hard to tune the number_of_bins and the max_min value of F0/energy. I tried softmax classification for f0/energy and the accuracy is just around 60%. Because of this, i think we can't make the predicted f0/energy good enough for the model. The results that i got based on the paper infomation is ok, but somehow the audio generate sounds no emotion, no stress :)) (don't know the exact word to describe). It just like the f0/energy predicted does not fluctuate too much with time :))), you can imagine that the f0/energy is constant though time :D.

  2. I want to try some new ideas such as use f0/energy predict rather than f0/energy groundtruth for training.

BTW, we have a korean sample trained on KSS dataset here (https://colab.research.google.com/drive/1ybWwOS5tipgPFttNulp77P6DAB5MtiuN?usp=sharing). I think the quality is good. I didn't try to train this data with the implementation based on FastSpeech2 paper :D.

One thing want to share is that when i implement the fastspeech2 model here, i didn't know about fastpitch. After know fastpitch and read it, i modify my code based on fastpitch repo :D. (it's almost the same as my idea).

Jackson-Kang commented 4 years ago

Thank you for giving me information.

I tried with predicted f0/energy values and it also successfully generate speech. But, I think there's almost no drastical change between g.t. and synthesis.

My experimental codes are below. code

Sincerely,

dathudeptrai commented 4 years ago

@Jackson-Kang https://github.com/TensorSpeech/TensorFlowTTS/blob/master/tensorflow_tts/models/fastspeech2.py#L250-L255. :))) i just add and let you know :)).

Jackson-Kang commented 4 years ago

@dathudeptrai

Thank you for your kind advice. :) I will refer it.

Sincerely,