Generate voice with spelling and numbers

kaniosm commented 2 weeks ago

Checks

[X] This template is only for usage issues encountered.
[X] I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
[X] I have searched for existing issues, including closed ones, and couldn't find a solution.
[X] I confirm that I am using English to submit this report in order to facilitate communication.

Environment Details

WSL with Python 3.12 running on RTX 4070.

Steps to Reproduce

Clone F5-TTS to a new dir
Create new venv with python (python3 -m venv .venv)
Install F5 (pip install -e .)
Generate the below audio: f5-tts_infer-cli --model "F5-TTS" --ref_audio "untitled1.wav" --ref_text "One last question. When I was coming in, I saw students holding signs like lollipops with the words STOP on them. What are these for?" --gen_text "Hello! My name is Maria, an AI assistant. I’m calling to make a reservation on behalf of a client , could you assist me with that? My name is spelled like M. A. R. I. A. and my phone number is 99 44 09 12."

✔️ Expected Behavior

Mention the numbers as those are written and spell the name.

❌ Actual Behavior

Not all numbers are mentioned. Not all letters in spelling are mentioned. Spelling is too fast. Some times the name is miss-pronounced. The generated audio is not consistent and the output varies a lot.

SWivid commented 2 weeks ago

Hi @kaniosm , could temporarily leverage https://github.com/SWivid/F5-TTS/blob/8a7e8495fff609cd8f4085c9efe8f2964995fc12/src/f5_tts/infer/utils_infer.py#L51 (the total seconds of ref_audio and to generate), and manually split long text, to have a fine control.

Currently is doing chunck generation and a simple linear estimated sentence-level duration; would be better with next generation of base model

SWivid commented 2 weeks ago

will close this issue, feel free to open if further questions

SWivid / F5-TTS