SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
https://arxiv.org/abs/2410.06885
MIT License
7.35k stars 880 forks source link

Getting audio from ref-audio during cross-language audio generation #505

Open Mu-iq opened 1 day ago

Mu-iq commented 1 day ago

Checks

Environment Details

windows 11, python==3.10.11, torch==2.3.0+cu118, gradio==4.44.1, GPU=24 gb

Steps to Reproduce

  1. python -m venv voice_clone_venv
  2. voice_clone_venv\Script\activate
  3. git clone https://github.com/SWivid/F5-TTS
  4. cd F5-TTS
  5. pip install -e .
  6. python F5-TTS\src\f5_tts\train\datasets\prepare_csv_wav.py "path to input dir" "path to output dir"
  7. python F5-TTS\src\f5_tts\train\finetune_cli.py --dataset_name arabic_finetune

this is the type of metadate.csv i have created Screenshot 2024-11-21 174313

✔️ Expected Behavior

I have fine-tuned the base model for Arabic using approximately 13 hours of sample data over 100 epochs and the other settings are default. However, I'm encountering an issue where the generated audio includes fragments of the reference audio. Specifically, when the reference audio is in Arabic and the text for generation is in English, the generated audio randomly includes irrelevant "garbage" content.

This problem doesn't occur when both the reference audio and the generated text are in Arabic—everything works fine in that case. Notably, I haven't modified the vocabulary file during fine-tuning. My goal is to use Arabic reference audio to generate clear English audio, but the model fails to do so and introduces this unintended content.

Even when I continue fine-tuning the model, the issue persists, and the pronunciation of English words in the generated audio becomes progressively worse. What could be causing this behavior?

❌ Actual Behavior

No response

SWivid commented 1 day ago

try fix duration if different language for ref and gen https://github.com/SWivid/F5-TTS/blob/ab2ad3b005ea839ab698493a819bde909761d96e/src/f5_tts/infer/utils_infer.py#L53 e.g. you use a ref_audio of 10s, and want to generate a 12s audio, set 22

the issue is because we are simply estimating the duration based on numbers of character. so if the rate of speaking is of great difference between languages, will have problem

there are few ways rather than fix a duration:

  1. train a duration predictor for the specific language
  2. using a bias for the simple estimate, e.g. if English speaking rate per char is a, and Arabic is b, and a factor a/b for https://github.com/SWivid/F5-TTS/blob/ab2ad3b005ea839ab698493a819bde909761d96e/src/f5_tts/infer/utils_infer.py#L443 vice versa