SWivid / F5-TTS

Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"

MIT License

3.65k stars 302 forks source link

Sometimes generates output with a phrase from reference audio. #85

Open chigkim opened 1 day ago

chigkim commented 1 day ago

The Ref_text includes the phrase 'our love of chocolate,' but the gen_text doesn't. However, sometimes the phrase gets added to the end of a generation. When using inference-cli with long texts, it might repeat the phrase 2-3 times throughout. It doesn't replace anything from the gen_text, just randomly adds it to end of some generations. Any idea what might be causing this?

SWivid commented 1 day ago

Mainly for unaccurate linearly estimated duration. As it is leaking ref_text, consider just remove some punc in ref_text.

11

jpgallegoar commented 1 day ago

Try to always add a "." at the end of the reference text and begin the prompt with a " ". Also make sure the reference audio is properly trimmed (starts and ends in a silence and not during a sound), since that can also cause issues.

chigkim commented 1 day ago

The ref_audio is recorded and edited well. There's no weird noise, nor cutoff phrases, etc. @SWivid: "consider just remove some punc in ref_text." @jpgallegoar "Try to always add a "." at the end of the reference text and begin the prompt with a " ". If it's just editing punctuations, should we make the script do this automatically?

jpgallegoar commented 1 day ago

The ref_audio is recorded and edited well. There's no weird noise, nor cutoff phrases, etc. @SWivid: "consider just remove some punc in ref_text." @jpgallegoar "Try to always add a "." at the end of the reference text and begin the prompt with a " ". If it's just editing punctuations, should we make the script do this automatically?

It's something I proposed but I'm not confident enough in that theory to create a PR and force it upon every generation. I also thought about passing the reference text and prompt to a small LLM with few-shot examples on how to accurately write everything to improve quality. This is still very early so we need to keep testing to find the best practices.

chigkim commented 1 day ago

Since batch processing automatically splits long texts into smaller chunks, I think punctuation should be adjusted automatically because users can’t predict how the text will be split. That said, I experimented with removing punctuation from the reference text and adding a space at the beginning and a period at the end of the generated text. Unfortunately, it didn’t seem to make much difference. I experimented with various reference audios, and here are a couple of observations. I'm not sure what it means, but the model seems to consistently selects the last phrase from the reference audio (never from the middle) and randomly inserts it. E2 model seems to have this problem more so than F5.

SWivid commented 1 day ago

@chigkim you are right. it is just an inherent problem exists in model that have no frame-level phoneme alignment - very sensitive to the given duration. It is the last phrase leaked cuz model is going to synthesize e.g. abcdefg | 12345678 where alphabet is the ref_text and number is gen_text; after synthesizing speech, the part of the same length as ref_audio will be cut and discard. So if the model is actually doing abcdef | g12345678, where abcdef is already the length of ref_audio, 'g' is leaked to output. A just accurate duration will help avoiding such cases, and that's why in the initial script we are suggesting using a fixed duration (to control just as properly, though not quite convenient in actual usage scenario) There will definitely be solution for this (as in past years we have seen big progress in TTS models)