coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.29k stars 4.31k forks source link

[Bug] Fine-tuned XTTS v1 .1: recurrence of prompts in the output audio. #3122

Closed Selectorrr closed 1 year ago

Selectorrr commented 1 year ago

Describe the bug

If we follow the instructions in this recipe, the trained model starts to have a problem with the recurrence of prompts in the output audio.

As noted on @erogol's blog in order to fix this problem in the model xtts v1.1:

we have introduced a new conditioning method that is based on masking. We take a segment of the speech as the prompt but mask that segment while computing the loss. This way, the model is not able to cheat by copying the prompt to the output.

It remains unclear how to use the same conditioning method when training a model using a recipe.

To Reproduce

Training a model using a recipe

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4090"
        ],
        "available": true,
        "version": "11.8"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.0.1",
        "TTS": "0.19.0",
        "numpy": "1.24.3"
    },
    "System": {
        "OS": "Windows",
        "architecture": [
            "64bit",
            "WindowsPE"
        ],
        "processor": "Intel64 Family 6 Model 191 Stepping 2, GenuineIntel",
        "python": "3.11.5",
        "version": "10.0.22621"
    }
}
Edresson commented 1 year ago

Hi @Selectorrr, the masking conditioning is used by default using this recipe.

I think the repetition issues are related to overfitting. If you train the model too much for a small dataset the model easily overfit. You should stop the training at 1~3 epochs of training. I recommend you try an early checkpoint.

Selectorrr commented 1 year ago

@Edresson You may be right, although I've tried it on a small number of epochs, my dataset is quite large. I'll try it on a dataset of about 2 hours with not too many epochs, I'll write about the results later.

Selectorrr commented 1 year ago

@Edresson Yes indeed if you use about 2 hours of audio and teach 1 epoch there are no problems and everything works well. I was originally planning to train the model on multiple speakers and a larger dataset, will wait for updates, thanks a lot you guys are awesome.