Closed forwiat closed 10 months ago
For the first round of training the flow-matching-based model, we run about 400 epochs (which might be too much, I guess 200 or so should be approximately enough). For the second round of flow rectification process, we again run for 400 epochs. Similarly, maybe a smaller number will also get good results.
Note that I also want to comment on "distinguish human voices". For models with this limited size and training data, it is very hard to achieve human-level quality and be indistinguishable from human voices. I think neither this VoiceFlow or the other ones like GradTTS, GlowTTS, etc., can produce speech super natural and with no artifact. So to be honest and moderate, maybe "good" can be expected but not indistinguishable from human voices. 😄
Thanks for correction and explanation! I will try it again:)
Hi author, I wonder how many epochs can got some good results, which can distinguish human voices