Open Mu-iq opened 1 day ago
try fix duration if different language for ref and gen https://github.com/SWivid/F5-TTS/blob/ab2ad3b005ea839ab698493a819bde909761d96e/src/f5_tts/infer/utils_infer.py#L53 e.g. you use a ref_audio of 10s, and want to generate a 12s audio, set 22
the issue is because we are simply estimating the duration based on numbers of character. so if the rate of speaking is of great difference between languages, will have problem
there are few ways rather than fix a duration:
a
, and Arabic is b
, and a factor a/b
for https://github.com/SWivid/F5-TTS/blob/ab2ad3b005ea839ab698493a819bde909761d96e/src/f5_tts/infer/utils_infer.py#L443 vice versa
Checks
Environment Details
windows 11, python==3.10.11, torch==2.3.0+cu118, gradio==4.44.1, GPU=24 gb
Steps to Reproduce
this is the type of metadate.csv i have created
✔️ Expected Behavior
I have fine-tuned the base model for Arabic using approximately 13 hours of sample data over 100 epochs and the other settings are default. However, I'm encountering an issue where the generated audio includes fragments of the reference audio. Specifically, when the reference audio is in Arabic and the text for generation is in English, the generated audio randomly includes irrelevant "garbage" content.
This problem doesn't occur when both the reference audio and the generated text are in Arabic—everything works fine in that case. Notably, I haven't modified the vocabulary file during fine-tuning. My goal is to use Arabic reference audio to generate clear English audio, but the model fails to do so and introduces this unintended content.
Even when I continue fine-tuning the model, the issue persists, and the pronunciation of English words in the generated audio becomes progressively worse. What could be causing this behavior?
❌ Actual Behavior
No response