Open bensonbs opened 3 weeks ago
Describe the bug
When generating higher-pitched female voices after fine-tuning the xtts-v2 model, there is a noticeable hoarseness, resembling the strain one might experience when trying to reach high musical notes.
abnormal example: https://mork.ro/NQjFi
normal example: https://mork.ro/3iZ8Q#
Two voices generated from the same model, using different audio prompts.
To Reproduce
infer
Expected behavior
No response
Logs
No response
Environment
{ "CUDA": { "GPU": [ "NVIDIA GeForce RTX 4090" ], "available": true, "version": "12.1" }, "Packages": { "PyTorch_debug": false, "PyTorch_version": "2.1.1+cu121", "TTS": "0.22.0", "numpy": "1.22.0" }, "System": { "OS": "Linux", "architecture": [ "64bit", "ELF" ], "processor": "x86_64", "python": "3.10.13", "version": "#202310061235~1697396945~22.04~9283e32 SMP PREEMPT_DYNAMIC Sun O" } }
Additional context
No response
Describe the bug
When generating higher-pitched female voices after fine-tuning the xtts-v2 model, there is a noticeable hoarseness, resembling the strain one might experience when trying to reach high musical notes.
abnormal example: https://mork.ro/NQjFi
normal example: https://mork.ro/3iZ8Q#
Two voices generated from the same model, using different audio prompts.
To Reproduce
infer
Expected behavior
No response
Logs
No response
Environment
{ "CUDA": { "GPU": [ "NVIDIA GeForce RTX 4090" ], "available": true, "version": "12.1" }, "Packages": { "PyTorch_debug": false, "PyTorch_version": "2.1.1+cu121", "TTS": "0.22.0", "numpy": "1.22.0" }, "System": { "OS": "Linux", "architecture": [ "64bit", "ELF" ], "processor": "x86_64", "python": "3.10.13", "version": "#202310061235~1697396945~22.04~9283e32 SMP PREEMPT_DYNAMIC Sun O" } }
Additional context
No response
I'm experiencing the same thing with 900 hours of Chinese data fine tuning, 40,000 STEP is prone to this. What is your data? Which languages? How many steps?
Describe the bug
When generating higher-pitched female voices after fine-tuning the xtts-v2 model, there is a noticeable hoarseness, resembling the strain one might experience when trying to reach high musical notes.
abnormal example: https://mork.ro/NQjFi
normal example: https://mork.ro/3iZ8Q#
Two voices generated from the same model, using different audio prompts.
To Reproduce
infer
Expected behavior
No response
Logs
No response
Environment
Additional context
No response