Getting audio from ref-audio during cross-language audio generation

Checks

[X] This template is only for usage issues encountered.
[X] I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
[X] I have searched for existing issues, including closed ones, and couldn't find a solution.
[X] I confirm that I am using English to submit this report in order to facilitate communication.

Environment Details

windows 11, python==3.10.11, torch==2.3.0+cu118, gradio==4.44.1, GPU=24 gb

Steps to Reproduce

python -m venv voice_clone_venv
voice_clone_venv\Script\activate
git clone https://github.com/SWivid/F5-TTS
cd F5-TTS
pip install -e .
python F5-TTS\src\f5_tts\train\datasets\prepare_csv_wav.py "path to input dir" "path to output dir"
python F5-TTS\src\f5_tts\train\finetune_cli.py --dataset_name arabic_finetune

this is the type of metadate.csv i have created Screenshot 2024-11-21 174313

✔️ Expected Behavior

I have fine-tuned the base model for Arabic using approximately 13 hours of sample data over 100 epochs and the other settings are default. However, I'm encountering an issue where the generated audio includes fragments of the reference audio. Specifically, when the reference audio is in Arabic and the text for generation is in English, the generated audio randomly includes irrelevant "garbage" content.

This problem doesn't occur when both the reference audio and the generated text are in Arabic—everything works fine in that case. Notably, I haven't modified the vocabulary file during fine-tuning. My goal is to use Arabic reference audio to generate clear English audio, but the model fails to do so and introduces this unintended content.

Even when I continue fine-tuning the model, the issue persists, and the pronunciation of English words in the generated audio becomes progressively worse. What could be causing this behavior?

❌ Actual Behavior

No response

try fix duration if different language for ref and gen https://github.com/SWivid/F5-TTS/blob/ab2ad3b005ea839ab698493a819bde909761d96e/src/f5_tts/infer/utils_infer.py#L53 e.g. you use a ref_audio of 10s, and want to generate a 12s audio, set 22

the issue is because we are simply estimating the duration based on numbers of character. so if the rate of speaking is of great difference between languages, will have problem

there are few ways rather than fix a duration:

train a duration predictor for the specific language
using a bias for the simple estimate, e.g. if English speaking rate per char is a, and Arabic is b, and a factor a/b for https://github.com/SWivid/F5-TTS/blob/ab2ad3b005ea839ab698493a819bde909761d96e/src/f5_tts/infer/utils_infer.py#L443 vice versa

SWivid / F5-TTS