SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
124 stars 11 forks source link

duration prediction #8

Open eschmidbauer opened 8 hours ago

eschmidbauer commented 8 hours ago

Thank you for sharing this project and models. Testing inference and it is very impressive. One thing i have noticed is how sensitive the fix_duration value is-- if it is off by 1s , the utterance may miss a word. Is there a way to predict duration so it is accurate on each utterance? Thanks again for sharing this work!

SWivid commented 8 hours ago

We currently estimate the duration linearly (if the prompt is not too short or of very high speed, it will work), or we just set a fixed duration. A separately trained duration predictor model could be leveraged to provide the duration info, though I thought to control precisely the duration of each utterance is more of the way Voicebox does. But for sure, we could have a utterance-level duration predictor and just pass in the add-up duration to TTS model.