Closed ElizavetaSedova closed 2 years ago
Hi, for your questions: (1) 2000 is determined by multiple trials, and 2000 achieves best performance; (2) I'm not sure whether increasing the batch size can give better results, but first of all, I think since the number of speakers for training is so large, you may need to keep data balance, i.e., similar number of utterances per speaker. Besides, you need to make sure that the audio is relatively clean without too much noise or reverberation, else the speaker encoder may learn those harmful information. Based on my experience, you can use a pre-trained speaker encoder (e.g., trained for speaker recognition task) instead of training the speaker encoder from scratch, this way can stabalize the training process and also improve the conversion performance when the number of training speakers is large.
Hello, I tried to hold training on a large dataset with more than 1,000,000 speakers. But the losses had very bad results. Can you please tell me how to calculate the parameters for training to achieve good results? (batch_size, n_prediction_steps, n_negatives, etc.) The results on small data (VCTK + LibriTTS) were good. You have a formula in your code warmup_epochs = 2000 // (len(dataset)//cfg.training.batch_size) Why 2000? What is the optimal number of warmup epochs? I tried to increase the batch size to speed up the training. But during the experiments it turned out that this worsened the results. I will be glad for any advice!