SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
6.95k stars 817 forks source link

Training the model from scratch, pronunciation is unintelligible #413

Open yygg678 opened 5 days ago

yygg678 commented 5 days ago

Checks

Question details

Using my own phone sequence, I trained the model from scratch, with about 200 hours of Chinese data and a 155M model. The synthesized speech is completely incomprehensible. How much data is generally needed to train a model from scratch?

SWivid commented 5 days ago

I have thoroughly reviewed the project documentation and read the related paper(s).

All details are given in our paper, including used training corpus for small model, batchsize, evaluation results from 400~800k updates. Train with same batchsize to approx. 200K updates will hear something intelligible.