[Bug] XTTS v2 - short utterances finetune doesn't work

SinanAkkoyun commented 3 months ago

Describe the bug

I can not get short utterances (a couple words) to work without hallucinations at the end, despite my training mix being 50/50 very short and long utterances. Why won't the GPT predict the EOT token correctly if it has seen enough examples already? (1h training data at epoch 46)

Is it due to some batched training optimization that neglects EOT tokens?

To Reproduce

Finetune model on 50/50 very short and long utterances (or sometimes see with pretrained xttsv2 checkpoint and custom speaker latent) and prompt with "Program complete." or something.

Expected behavior

It should cut off after generating the sentence.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4090",
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.4.0+cu121",
        "TTS": "0.24.1",
        "numpy": "1.26.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.11.5",
        "version": "#128-Ubuntu SMP Fri Jul 5 09:28:59 UTC 2024"
    }
}

Additional context

No response

SinanAkkoyun commented 3 months ago

I am at epoch 241 and it just gets worse. It hallucinates, even after a 7 word sentence. There must be something wrong with batched padding or something, I'd appreciate help.

duringleaves commented 3 months ago

I get hallucinations and slurs at the beginning and end of short phrases. I've trained it on a mix of short (less than 7 words) and long phrases, but it just doesn't like it.

SinanAkkoyun commented 3 months ago

@duringleaves you get slurs? xD

duringleaves commented 3 months ago

Hahaha. Well, the SLURRED words and "oy NUk!" Breakouts tamed down with a much lower temperature. Still not perfect, but more usable than I feared.

rose07 commented 1 month ago

https://tts.byylook.com/ai/text-to-speech

JohannPie commented 3 weeks ago

Same Problem here.. Length Penalty also doesnt seem to work for me..

duringleaves commented 3 weeks ago

Just to follow up, I've pretty much resolved this. I was using the xtts-finetune-webui project which automates the dataset creation. There are a lot of problems with this, not really the fault of the author.

whisper is used for transcribing and timestamping the phrases, which is good because it's accurate, but bad because it takes a lot of liberty with the results. Someone saying, "twelve fifty" might get transcribed as "twelve point five" or "one thousand two hundred and fifty". whisper's timestamps can also end before the last utterance of a word has completed. When the script the uses these timestamps to cut the larger audio file up into individual phrases, it means a lot of words will be truncated and the last little sounds will be at the start of the next file. Ultimately, this means the tts finetune thinks these little gutteral bits and blats are part of the speech pattern.

Ultimately, the best way to get really good results is to hand-build, or at least judiciously audit and edit, your dataset so that none of these clips get into the training csv's. It's better to have fewer high-quality audio files than to train on lots of bad data. Even for single word utterances, it's far better now.

coqui-ai / TTS