SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
7.48k stars 924 forks source link

Training the model from scratch, pronunciation is unintelligible #413

Closed yygg678 closed 1 week ago

yygg678 commented 2 weeks ago

Checks

Question details

Using my own phone sequence, I trained the model from scratch, with about 200 hours of Chinese data and a 155M model. The synthesized speech is completely incomprehensible. How much data is generally needed to train a model from scratch?

SWivid commented 2 weeks ago

I have thoroughly reviewed the project documentation and read the related paper(s).

All details are given in our paper, including used training corpus for small model, batchsize, evaluation results from 400~800k updates. Train with same batchsize to approx. 200K updates will hear something intelligible.

SWivid commented 1 week ago

will close this issue, feel free to open if further questions