X-LANCE / VoiceFlow-TTS

[ICASSP 2024] This is the official code for "VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching"
https://cantabile-kwok.github.io/VoiceFlow/
276 stars 20 forks source link

about tts results #3

Closed forwiat closed 10 months ago

forwiat commented 10 months ago

Hi author, I wonder how many epochs can got some good results, which can distinguish human voices

cantabile-kwok commented 10 months ago

For the first round of training the flow-matching-based model, we run about 400 epochs (which might be too much, I guess 200 or so should be approximately enough). For the second round of flow rectification process, we again run for 400 epochs. Similarly, maybe a smaller number will also get good results.

Note that I also want to comment on "distinguish human voices". For models with this limited size and training data, it is very hard to achieve human-level quality and be indistinguishable from human voices. I think neither this VoiceFlow or the other ones like GradTTS, GlowTTS, etc., can produce speech super natural and with no artifact. So to be honest and moderate, maybe "good" can be expected but not indistinguishable from human voices. 😄

forwiat commented 10 months ago

Thanks for correction and explanation! I will try it again:)