SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
7.36k stars 885 forks source link

shot sentences question,and dataset question. #433

Closed 0x1001u closed 2 weeks ago

0x1001u commented 2 weeks ago

Checks

Question details

First question: Does a sampling rate of 16000 have any impact on the audio? Does it have to be 24000? The second issue is that when generating sentences of one or two words, it is not possible to generate them correctly. For example: Hello, thank you. May I ask how to adjust it?

SWivid commented 2 weeks ago

The current base model is trained on 24khz, so it needs an input with 24khz. For short sentence generation, the simply linear estimated duration is not accurate, thus consider fix duration (reference + to generate) , in seconds https://github.com/SWivid/F5-TTS/blob/e78ae2ce92ff1f2357af70e08b44fb09981ce1ce/src/f5_tts/infer/utils_infer.py#L51