CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.71k stars 8.8k forks source link

Training 2-3 models, suggestions? #1157

Open prakharpbuf opened 1 year ago

prakharpbuf commented 1 year ago

Hi, Great work with Real Time Voice Cloning!

I already got some experience training the models. I fine tuned the model with one of the speakers from dev-clean LibriSpeech and successfully got noticeable improvement in the output quality. Now I'm going to train two or three models: 1. Everything from scratch using LibriTTS dataset. I know blue-fish (Now @ghost) and @mbdash tried to train a model using LibriTTS in #449 but the output did not improve. They were still trying and moved the discussion to a slack channel so I don't know what the end result was. If anyone knows what happened after they trained the new encoder and everything and can share the result (and even better, the model) will be much appreciated! After this model is trained, I might also fine tune it for my voice. (same as point 2. in this post) 2. Fine tune the pre trained model on 1 hour of my own voice. I know it has been noted by blue-fish in #437 that fine tuning on your voice with 0.2hr of data and training for a few thousand steps improves the quality of output for your voice. But I wonder what will happen if instead of 0.2hr, I use a whole hour (maybe more) and train it for more than just a few thousand, maybe in the order of 10 thousands. 3. Maybe also a model using the pretrained model and train it for a few additional 100-200k steps using more data from Mozilla Common Voice or something else. Do you think this will be useful?

In #126, @sberryman trained all three models from scratch but he was not happy with the synthesizer and vocoder that he trained. I'm not sure what it means because I don't have any experience with AI but he says that the synthesizer did not align well? Also, he said the encoder was pretty good though. He has uploaded his models and the link still works, maybe I can try something with the models he trained? He has a better encoder.

I don't have a very fast memory (I'll be using external hard drives) I have NVIDIA Quadro P2000 GPUs But, to train each model, I'll use separate PCs with same specs so they all train in parallel.

Any suggestions on playing around with hparams, different ideas for training, or anything else? All suggestions are welcomed and appreciated.

Also, if you have any ides on training, lemme know. Like let me know what to train (one of the pretrained/train from scratch), what dataset to use, what hparams, and how much to train, and I'll do it. I have plenty of time.

Thanks!

oops408 commented 1 year ago

try using model architecture (eg. location vs. contact-based) and loss functions as hparams and see if those help fine tune it. i'm trying out SGD optimization to see if that would improve the results. oh yeah, maybe pitch shifting would be interesting as well...