Naozumi520 / Bert-VITS2-Cantonese-Yue

vits2 backbone with multilingual-bert, modified to support Cantonese
GNU Affero General Public License v3.0
5 stars 1 forks source link

Audio length #2

Closed kexul closed 6 months ago

kexul commented 7 months ago

Hi, thanks for your effort! In your expriments, how long doest it take for audio files to train a decent model? In this issue, you said you've trained with 9~11 hours of data, is this the model hosted in the huggingface page? I've asked my friend who is a native cantonese speaker, he said the output quality is not so good, is there room for improvement if more data is used for training? or is there flaw in the text frontend or base model which limited the quality? Many thanks!

Naozumi520 commented 6 months ago

Hi @kexul, I'm happy that you're also interested in Cantonese TTS! Yes, I've used 10 hours of podcast data to train the model, which is bert-vits2-yue-base. For the output quality, I have to admit that it's definitely not so good, as your friend reacted. And yes, there are room to improve the model, with more data used for training. And to improve the speech, you should finetune the model by using little amount of data (more is better but very little amount like 150 samples can already change the speaking style), and it should improve the emotion and the fluency, as the podcast training data is mostly emotionless and uniformly.

kexul commented 6 months ago

And to improve the speech, you should finetune the model by using little amount of data (more is better but very little amount like 150 samples can already change the speaking style), and it should improve the emotion and the fluency

Thank you so much! It's good to know that only 150 samples is sufficient. I'm about to finetune it with 林雪's voice from some of his movie clips, which is very emotional. 😆 Hopefully I'll get a good result! 🤩

Naozumi520 commented 6 months ago

Good luck! 😆

kexul commented 6 months ago

I'd like to thank you again! @Naozumi520
I've finetune on your base model and get really decent result so far! With only 8 minutes, about 120 samples. I'll gather more samples and see if it gets better. Thanks for your work and open source!