CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.09k stars 8.72k forks source link

Training on RTX 3090. Batch Sizes and other parameters? #914

Closed Dannypeja closed 2 years ago

Dannypeja commented 2 years ago

Hi, sorry that I have to ask these questions here. If there was a discord or something like that I could ask there.

I have access to a RTX 3090, there I want to use it's VRAM to increase batch sizes. I have learned that higher batch sizes = faster and better progress, at least to some level.

If yes:

Where do I find the Parameters for the Batch sizes of the three models?

Are there locations correct:

Encoder:

https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/b7a66742361a3f9a06cdef089cdebf9f6cd82b11/encoder/params_model.py#L9-L11 maybe I can increase it to 64 like it was in the original GE2E

Synthesizer:

The last value in the brackets "12"? https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/7432046efc23cabf176f9fdc8d2fd67020059478/synthesizer/hparams.py#L53

Vocoder:

https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/b7a66742361a3f9a06cdef089cdebf9f6cd82b11/vocoder/hparams.py#L33-L37

what are reasonable batch sizes for RTX 3090?

if No:

would also be good to know. Are there any other advantages I could take with this monster GPU?

How long should decent training take?

Again I have no frame of reference.

Thanks a lot in advance to anyone who can get me some understanding to this new and extremely interesting topic!

Bebaam commented 2 years ago

Hey, the locations for batch_size changes look fine. The huge amount of VRAM should allow a relatively fast training, so I would just try using different batch sizes and monitor the VRAM usage if possible. Depending on your overall intent, you could also use a greater model for training - at least for the encoder (I wouldn't suggest that, if you don't have much more than 10000 different speaker). More important is: What do you want to achieve? Do you want to train a model from scratch (#126 - quite old, but imo you get good insights of training), maybe in a new language? Then which datasets do you want to use? Or do you want to further finetune existing models? Maybe even go for a single speaker (#437)? And training procedure: https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training

Dannypeja commented 2 years ago

Thanks for the quick answer! I mistakenly disguised my intentions: I want to train new models for the German language to better understand how this all works. I want to the mailabs dataset + Mozilla common voice. that makes up about 15.000 voices of very various quality and I am not sure if quality will be enough for vocoder and Synthesizer. What did you mean by “use a greater model for the encoder”? Increase model dimensions? Where would I do so? also: do I need to adjust anything else if I increase batch size? Is there any approximation how long training could possibly take?

Bebaam commented 2 years ago

Okay. For the encoder you could change the embedding size or hidden embedding size from 256 to e.g. 768 in encoder/params_model.py, like it was discussed in #126 . IMHO and after the question in #840 I would suggest to keep it at 256 for embedding size and hidden size, and only if you do not get good results, I would start increasing hidden embedding size first, as it does not influence synthesizer model size and allows a faster training. I think you don't need to adjust anything else if you increase the batch size. The Common Voice dataset is the biggest by far for german models AFAIK, and should be sufficient for a first training. If you want to extend it, you could also have a look on the LibriVox dataset. When I need to make a bet for training time, I would say: For Encoder, with the GTX 3090 about 3 days, I think it also depends a bit on hard disk and CPU. It is not sure if it ever converges, but I would train it until loss is less than 0.01 or better 0.005. With the batch_size possible on gtx 3090, I would say 50k-100k steps could be enough. After that, synthesizer training will take only one day. If you use 16k as audio sample rate, you don't need to train a vocoder, the pretrained one in this repo will be sufficient. But I can't guarantee all of this, I only make guesses :D

Bebaam commented 2 years ago

When training the synthesizer after the encoder, it is important that your models learns attention. For this purpose, you can look into your synthesizer/saved_models/plots. The attention plots should form some kind of a line from top left to top right (maybe they are cut at some point). Until attention is learned, the plots are empty or have some random noise. I think you will see it after 5-10k steps. I would train the synthesizer until it converges, and change the synthesizer/hparams.py values before training. As you won't use a batch size of 12 as it is by default, I would decrease the steps. Maybe something like:

tts_schedule = [(2, 1e-3, 5_000, 50), # Progressive training schedule (2, 5e-4, 10_000, 50), # (r, lr, step, batch_size) (2, 2e-4, 20_000, 50), # (2, 1e-4, 30_000, 50), # r = reduction factor (# of mel frames (2, 3e-5, 50_000, 50), # synthesized for each decoder iteration) (2, 1e-5, 70_000, 50)], # lr = learning rate

But I would test it and adjust it properly :)

Dannypeja commented 2 years ago

Hey thank you so much, that was really detailled and also helps me to get a better feeling for all those values! I will leave embedding size at 256 for now.

You are talking about attention for encoder. Do I need to change any parameters for that? I assume not, since you wrote only to increase bs. How should I do so? Just increase batch_size until out-of-memory errors occur?

Is attention also a thing for synthesizer? About the tts_schedule: You did some adjustments besides bs there. Could you elaborate how you guessed r, lr, and step? Just from experience? I'm asking just to understand your gut feeling :) What is the r-factor in this context?

About the vocoder: Do you mean that vocoding a mel spectrogram is language Independent? So the only reason to retrain it is in order to get higher bitrate as output?

Common voice seems large, but there is a lot of junk inside there and I hope i will be clean enough. I have no feeling how clean it has to be for the synthesizer. After I get the Data sorted... jeez... I will try out everything and report! Is there a way to contact you (discord) where we can continue the conversation? I still don't know if here is the right place since it should be for technical issues as I thought. :)

Thanks a lot!

Bebaam commented 2 years ago

You are welcome. I needed a lot of time to check which things I need to change, so I am happy if it helps now :) Sorry if it was wrongly communicated, but attention is important for synthesizer only. For the encoder the loss itself should be enough, you also get a plot of some speakers every x steps.

About the tts schedule: A higher r-value would faster training but could have some disadvantages (can't rembember which :D). Easiest would be to stay at a value of 2. If the model does not learn attention over all, you could try a higher value. Yeah the values of the schedule are just from experiments. The steps where you change the learning rate depend on the batch size, as the model will train faster with a higher batch size.

If you use e.g. 48k sample rate for the synthesizer, the sound generated by the 16k pretrained vocoder here would be three times as fast as usual. So the vocoder should be fine and language independent. You can try it after you have your synthesizer and fine tune it. Or you can train one from scratch, with your GPU it should be fast and can be done with the same datasets which you use for training the synthesizer. Just follow the steps for training in the wiki I mentioned earlier :)

Bebaam commented 2 years ago

Unfortunately there is no other way than here, and I am also just a beginner with this stuff. I will try to check the repo sometimes, feel free to ask more questions. Maybe it will help other people too :)

Dannypeja commented 2 years ago

Sorry to ask again:

Real-Time-Voice-Cloning/encoder/params_model.py

learning_rate_init = 1e-4 speakers_per_batch = 64 utterances_per_speaker = 10

is speakers_per_batch the batch size? Or is it utterances_per_speaker or is it both kinda? what values would you imagine for a rtx 3090?

Bebaam commented 2 years ago

Yes, speakers_per_batch is the batch size. utterances_per_speaker can remain as it is. I think I would start with batch size of 150, but I would track it with "watch nvidia-smi" or "nvidia-smi -l 5" for example. I would maximise GPU VRAM usage, so increase batch_size to the limit if possible. Training the encoder will take the most time in my experience.

Dannypeja commented 2 years ago

Do these values look like they should? This is encoder training with batch size of 150. I am continuing the german training of padmalcom with only the mailabs dataset (6 Speakers and 900 hours). I thought maybe the loss is not improving a lot anymore and that it could be due to the poor speaker amount.

Maybe after I manage to add commonvoice with 15k speakers it should continue improving?

Bildschirmfoto 2021-12-13 um 11 39 16

Bebaam commented 2 years ago

Loss is way too high, after this amount of steps it should be at 0.0x., at least with this batch_size and with commonVoice dataset. Do you have proper folder structure as we thought about in #934 ? Could you show a picture of the embeddings? It should be saved in encoder/saved_models/your_model. Furthermore, you could try another trick, which may increase training speed a bit. Does training speed improve, if you change "pin_memory" to True in encoder/data_objects/speaker_verification_dataset?

Dannypeja commented 2 years ago

This is the embedding. But as told: it is not commonvoice yet. It just Mailabs. Bildschirmfoto 2021-12-13 um 13 49 48

necron1976 commented 2 years ago

I use the google translator, sorry for the spelling. I am training a model from scratch, my programming knowledge is basic "I learn from google". I was reading and I leave you the image of how my training is going, and you will see that my loss is less than yours with 12,280 steps. I do not know what the problem will be. I use Windows, anaconda, Rtx3090. I made the mistake of using the bad hard disk, does it improve a lot if I use the sdd? I take this opportunity to ask, do you see my graph correct, how long do I have to train him? Your time mean: and std: they are better than mine, do I have to change something? buen

Dannypeja commented 2 years ago

How large is your dataset? :)

necron1976 commented 2 years ago

How large is your dataset? :)

"I'm using" approximately 3900 speakers "890k audios", separated by folders.

I have another dataset with thousands of speakers but they are all mixed up in the same folder. I'm not using them, because I don't know if they can be used while they are all in the same folder. "

Edit: I tried to pass the data to the ssd disk and the difference is brutal. It takes a few hours to transfer the data from one disk to the other. But it's flying now. I didn't think I was going to see that much of a difference. aaa