Closed kaniosm closed 2 weeks ago
Hi @kaniosm , could temporarily leverage https://github.com/SWivid/F5-TTS/blob/8a7e8495fff609cd8f4085c9efe8f2964995fc12/src/f5_tts/infer/utils_infer.py#L51 (the total seconds of ref_audio and to generate), and manually split long text, to have a fine control.
Currently is doing chunck generation and a simple linear estimated sentence-level duration; would be better with next generation of base model
will close this issue, feel free to open if further questions
Checks
Environment Details
WSL with Python 3.12 running on RTX 4070.
Steps to Reproduce
✔️ Expected Behavior
Mention the numbers as those are written and spell the name.
❌ Actual Behavior
Not all numbers are mentioned. Not all letters in spelling are mentioned. Spelling is too fast. Some times the name is miss-pronounced. The generated audio is not consistent and the output varies a lot.