[Bug] Hoarseness in Higher-Pitched Female Voices with xtts-v2 after finetune

Describe the bug

When generating higher-pitched female voices after fine-tuning the xtts-v2 model, there is a noticeable hoarseness, resembling the strain one might experience when trying to reach high musical notes.

abnormal example: https://mork.ro/NQjFi

normal example: https://mork.ro/3iZ8Q#

Two voices generated from the same model, using different audio prompts.

To Reproduce

infer

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4090"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.1+cu121",
        "TTS": "0.22.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.13",
        "version": "#202310061235~1697396945~22.04~9283e32 SMP PREEMPT_DYNAMIC Sun O"
    }
}

Additional context

No response

Describe the bug

When generating higher-pitched female voices after fine-tuning the xtts-v2 model, there is a noticeable hoarseness, resembling the strain one might experience when trying to reach high musical notes.

abnormal example: https://mork.ro/NQjFi

normal example: https://mork.ro/3iZ8Q#

Two voices generated from the same model, using different audio prompts.

To Reproduce

infer

Expected behavior

No response

Logs

No response

Environment
{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4090"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.1+cu121",
        "TTS": "0.22.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.13",
        "version": "#202310061235~1697396945~22.04~9283e32 SMP PREEMPT_DYNAMIC Sun O"
    }
}
Additional context

No response

Describe the bug

When generating higher-pitched female voices after fine-tuning the xtts-v2 model, there is a noticeable hoarseness, resembling the strain one might experience when trying to reach high musical notes.

abnormal example: https://mork.ro/NQjFi

normal example: https://mork.ro/3iZ8Q#

Two voices generated from the same model, using different audio prompts.

To Reproduce

infer

Expected behavior

No response

Logs

No response

Environment
{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 4090"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.1+cu121",
        "TTS": "0.22.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.13",
        "version": "#202310061235~1697396945~22.04~9283e32 SMP PREEMPT_DYNAMIC Sun O"
    }
}
Additional context

No response

I'm experiencing the same thing with 900 hours of Chinese data fine tuning, 40,000 STEP is prone to this. What is your data? Which languages? How many steps?

coqui-ai / TTS

[Bug] Hoarseness in Higher-Pitched Female Voices with xtts-v2 after finetune #3774

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context