Open iDavide opened 2 weeks ago
UPDATE: that wasn't a phonemes problem. The training wasn't loading because i didn't use the colab gpu. So the problem it's all about the configuration i guess? Now I set epochs=2000
and I keep you update if I come back with some results.
i keep training the model and now it has done 33k steps. the function of avg_loss_1 on the tensorboard is coverging to 25 but im noticing it's improving very slowly... is there a method to use efficiently colab's gpu? am i doing the training right based on the configuration? how many steps should i do in order to get a good vits model?
Describe the bug
Hi everyone. I'm new to the world of ML, so I'm not used to training AI models... I really want to create my own TTS model using coqui's VITS trainer, so I've done a lot of research about it. I configured some dataset parameters and configuration functions and then started training. For the training I used almost 10 hours of audio spoken in Italian. After training I tried the model but the result is not bad, it's FAIRLY bad... The model doesn't even "speak" a language. Here is an example of the sentence:
"input_text": ""input_text": "Oh, finalmente sei arrivato fin qui. Non è affatto comune che un semplice essere umano riesca a penetrare così profondamente nella mia dimora. Scarlet Devil Mansion non è un posto per i deboli di cuore, lo sapevi?""
(I do not recommend to listen to the audio at full volume.)
https://github.com/user-attachments/assets/b4039119-2666-455f-8ed7-6a0b05179f8f
The voice of the audio is actually from a RVC model. I imported the model into a program that makes TTS first and then uses the weights of a RVC model to the generated audio. It's not a RVC problem because I used this program with the same RVC and other TTS models (mostly in english and one in italian) and they work well, especially the english ones.
To Reproduce
Here's my configuration:
Dataset config:
Dataset format:
Audio:
Characters:
General config:
Expected behavior
No response
Logs
No response
Environment
Additional context
Additionally, After few days I tried to use espeak phonemes but the trainer.fit() function stucks at the beginning with this output: