SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
7.36k stars 884 forks source link

Generate voice with spelling and numbers #409

Closed kaniosm closed 2 weeks ago

kaniosm commented 2 weeks ago

Checks

Environment Details

WSL with Python 3.12 running on RTX 4070.

Steps to Reproduce

  1. Clone F5-TTS to a new dir
  2. Create new venv with python (python3 -m venv .venv)
  3. Install F5 (pip install -e .)
  4. Generate the below audio: f5-tts_infer-cli --model "F5-TTS" --ref_audio "untitled1.wav" --ref_text "One last question. When I was coming in, I saw students holding signs like lollipops with the words STOP on them. What are these for?" --gen_text "Hello! My name is Maria, an AI assistant. I’m calling to make a reservation on behalf of a client , could you assist me with that? My name is spelled like M. A. R. I. A. and my phone number is 99 44 09 12."

✔️ Expected Behavior

Mention the numbers as those are written and spell the name.

❌ Actual Behavior

Not all numbers are mentioned. Not all letters in spelling are mentioned. Spelling is too fast. Some times the name is miss-pronounced. The generated audio is not consistent and the output varies a lot.

SWivid commented 2 weeks ago

Hi @kaniosm , could temporarily leverage https://github.com/SWivid/F5-TTS/blob/8a7e8495fff609cd8f4085c9efe8f2964995fc12/src/f5_tts/infer/utils_infer.py#L51 (the total seconds of ref_audio and to generate), and manually split long text, to have a fine control.

Currently is doing chunck generation and a simple linear estimated sentence-level duration; would be better with next generation of base model

SWivid commented 2 weeks ago

will close this issue, feel free to open if further questions