[Bug] Demo Inference Produces Distorted Audio Output

Describe the bug

I followed the demo code provided by Coqui to create a simple dataset and fine-tune a model using Gradio. However, when I load the model and perform inference, the output audio is heavily distorted, resembling the sound of a hair shaving machine.

You can listen to the output at the following link: Distorted Audio Output.

Steps to Reproduce:

Create Dataset:

Followed the instructions to create a simple dataset using the demo code. Fine-Tune Model:

Used the Gradio interface as provided in the demo to fine-tune the model. Load Model and Inference:

Loaded the fine-tuned model. Create a simple dataset, fine-tune and performed inference using the Gradio interface with the following setup:

py TTS/TTS/demos/xtts_ft_demo/xtts_demo.py

The model should produce a clear and intelligible speech output corresponding to the input text.

Actual Result:

The output audio is distorted and unintelligible. You can hear the output here: Distorted Audio Output.

Additional Information:

I verified that CUDA and the NVIDIA drivers are correctly installed and operational. The nvidia-smi command confirms that the GPU is recognized and utilized by the system. Other models and libraries utilizing CUDA work as expected. Logs and Error Messages:

No explicit error messages were encountered during the execution. The process completes without any exceptions.

Request:

Could you please provide guidance on how to resolve this issue or if there are any specific configurations required to avoid such distortion in the output?

Thank you for your assistance.

To Reproduce

py TTS/TTS/demos/xtts_ft_demo/xtts_demo.py

Expected behavior

No response

Logs

No response

Environment

- Operating System: Window 11
- Python Version: 3.10.4
- CUDA Version: 11.5
- PyTorch Version: 1.11.0+cu115
- coqui-ai Version: Last Update on github

Additional context

No response

coqui-ai / TTS