Open iDavide opened 1 month ago
UPDATE: that wasn't a phonemes problem. The training wasn't loading because i didn't use the colab gpu. So the problem it's all about the configuration i guess? Now I set epochs=2000
and I keep you update if I come back with some results.
i keep training the model and now it has done 33k steps. the function of avg_loss_1 on the tensorboard is coverging to 25 but im noticing it's improving very slowly... is there a method to use efficiently colab's gpu? am i doing the training right based on the configuration? how many steps should i do in order to get a good vits model?
i stopped training the model because i thought the dataset might not be sufficient (~8 hours of speech). i'll try with bigger dataset such as the MLS one... additionally, i came to a conclusion: since the tts model generates mel spectrograms (not sure about that), i need to train an italian vocoder model, but i dont really know how to do that with a specific language. any comment is appreciated
Vits directly outputs audio because it trains its own vocoder internally, you don't need to train a separate one.
Vits directly outputs audio because it trains its own vocoder internally, you don't need to train a separate one.
you saved me a lot of time, thank you. by the way, now im building the transcripts for MLS dataset deleting all the files with a male voice (i actually want to train a woman voice model) but the thing is: when i train using Colab, the first epoch goes slowly while the next ones go really fast... if im using a dataset with a bunch of audio files,the epochs require more steps to do. said so, should i train using my gpu (gtx 1650) or colab gpu (free plan)?
Describe the bug
Hi everyone. I'm new to the world of ML, so I'm not used to training AI models... I really want to create my own TTS model using coqui's VITS trainer, so I've done a lot of research about it. I configured some dataset parameters and configuration functions and then started training. For the training I used almost 10 hours of audio spoken in Italian. After training I tried the model but the result is not bad, it's FAIRLY bad... The model doesn't even "speak" a language. Here is an example of the sentence:
"input_text": ""input_text": "Oh, finalmente sei arrivato fin qui. Non è affatto comune che un semplice essere umano riesca a penetrare così profondamente nella mia dimora. Scarlet Devil Mansion non è un posto per i deboli di cuore, lo sapevi?""
(I do not recommend to listen to the audio at full volume.)
https://github.com/user-attachments/assets/b4039119-2666-455f-8ed7-6a0b05179f8f
The voice of the audio is actually from a RVC model. I imported the model into a program that makes TTS first and then uses the weights of a RVC model to the generated audio. It's not a RVC problem because I used this program with the same RVC and other TTS models (mostly in english and one in italian) and they work well, especially the english ones.
To Reproduce
Here's my configuration:
Dataset config:
Dataset format:
Audio:
Characters:
General config:
Expected behavior
No response
Logs
No response
Environment
Additional context
Additionally, After few days I tried to use espeak phonemes but the trainer.fit() function stucks at the beginning with this output: