Wendison / VQMIVC

Official implementation of VQMIVC: One-shot (any-to-any) Voice Conversion @ Interspeech 2021 + Online playing demo!
MIT License
334 stars 55 forks source link

Question About Batch Size, number of Epochs and Learning Rate #13

Closed jlmarrugom closed 2 years ago

jlmarrugom commented 2 years ago

Hi @Wendison , I've already has trained some models (with VCTK subsets and external speakers) and could notice that a bigger batch size doesn't necessarily results in better audio quality for the same 500 epochs, in some cases, audio quality could be worse (For male References). My question is:

Do you have any report or experiments with different Batch Sizes, Number of Epochs (Why 500 and not 600 or more), and Learning Rates for different batch sizes?

If not, what advice could you provide regarding the Batch Size and the number of Epochs? The bigger the better?

For complex data like this there should be an improvement on bigger batches, but learning rate or number of epochs should be tuned.

Thank You.

Wendison commented 2 years ago

Hi, during my experiments, I have tried different settings for batch-size, training epochs and learning rate.

For batch-size, I remember that the initial value was set to 64, but later I set it to 256 to speed up the training and found there was no harm on the performance, so I just fixed it to 256.

For training epochs, I chose the best epoch based on two aspects: (1) For validation data, the reconstruction loss doesn't decrease, and cpc-prediction accuracy doesn't increase; (2) Listen to the intermediate converted results on validation speakers (consider several conversion pairs), and chose the epoch that has best converted quality.

For learning rate, it is set empirically, as I found the validation reconstruction loss began to increase after epoch-300, so I decrease the learning rate to avoid the overfitting.

As for your points below, I'm not sure whether its true, as I didn't notice too much difference between smaller and larger batch size. For complex data like this there should be an improvement on bigger batches

According to my experience, listening to the intermediate converted results is very important to determine whether the training is successful, so I suggest you to do so. Besides, you may tune the value for mi_weight in https://github.com/Wendison/VQMIVC/blob/72c650c2d8c6190d25455063f57cffb0be07938f/config/train.yaml#L8 as it influences the disentanglement performance, larger value leads to less dependencies between different speech representations, but also may degrade the quality of converted voice.

Hope this helps :)

jlmarrugom commented 2 years ago

Thank you, I've created a notebook to check audio conversions of the saved checkpoints.

From my experiments, I think that this model will perform better on large datasets, like LibriTTS, since it has a good generalization. It would be important to add male speakers, since VCTK is Gender imbalanced towards Female speakers. Some conversions tends to mimic a female voice.

Wendison commented 2 years ago

That's great finding! Besides, I also conducted some experiments on LibriTTS, and found that the content encoder can extract more accurate/robust content representations for out-of-domain speakers, hence source content can be well preserved. Since VCTK has relatively limited vocabulary, while LibriTTS has more diverse vocabulary, which improves the generalization ability of content encoder.

jlmarrugom commented 2 years ago

Excellent!, Do you know how many utterances are needed for each speaker?, I tested the model with 60 utt/spkr and 120 utt/spkr, and obtained decent results on 60 and comparable results on 120, maybe to speed up training on libriTTS one could select 100 utt per speaker and obtain a good result. What do you think?

Wendison commented 2 years ago

I didn't test how many utterances are required per speaker, I just used the entire data from train-clean-360h & train-clean-100h of LibriTTS for training, and test-clean for testing, so they were just preliminary experiments. But your ideas sound reasonable, data balance for each speaker should be considered, which may lead to better results.

jlmarrugom commented 2 years ago

Thank You!