Closed 0x1001u closed 2 weeks ago
The current base model is trained on 24khz, so it needs an input with 24khz. For short sentence generation, the simply linear estimated duration is not accurate, thus consider fix duration (reference + to generate) , in seconds https://github.com/SWivid/F5-TTS/blob/e78ae2ce92ff1f2357af70e08b44fb09981ce1ce/src/f5_tts/infer/utils_infer.py#L51
Checks
Question details
First question: Does a sampling rate of 16000 have any impact on the audio? Does it have to be 24000? The second issue is that when generating sentences of one or two words, it is not possible to generate them correctly. For example: Hello, thank you. May I ask how to adjust it?