CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.59k stars 8.79k forks source link

About encoder training #192

Closed geekboood closed 4 years ago

geekboood commented 4 years ago

Hi, I try to train the encoder on a mandarin dataset, which has about 1700 speakers and 900 hours of data. I already run 220k steps but the results have some problems. I use the train code in this repo and the test code in the Resemblyzer repo. download As you can see, the similarity between speakers seems fine, but the similarity between utterances has a problem. Can you provides any suggestions? Should I continue to train the encoder to 300k steps or 1M steps? Or should I use a larger dataset? This is the loss curve on the training set. SharedScreenshot

CorentinJ commented 4 years ago

Your scores are decent, don't worry. If you can include more data it would definitely improve.

geekboood commented 4 years ago

@CorentinJ But the similarity between utterances are not good, which means this encoder cannot match two utterances from the same speaker. Won't that be a problem?

vezzick commented 4 years ago

A bit of a tangent but what setup are you doing the training on and what hyperparameters? That could be affecting your training, having a bit of trouble getting the right setup myself.

CorentinJ commented 4 years ago

@CorentinJ But the similarity between utterances are not good, which means this encoder cannot match two utterances from the same speaker. Won't that be a problem?

They're good enough, look at the medians. It's not a major issue that the two distributions overlap a bit.

A bit of a tangent but what setup are you doing the training on and what hyperparameters? That could be affecting your training, having a bit of trouble getting the right setup myself.

You might find answers here

geekboood commented 4 years ago

@vezzick Actually I use hidden size of 768 and mel bins of 80 to train the model. My num of speakers per batch is 64 and num of utterances is 10, which is a bit small. In the training time, IO seems a problem and my GPU is not fully utilized. Also I train the model on a RTX card and use FP16 to accelerate the training process.

geekboood commented 4 years ago

Another problem is the ReLU activation function. It seems that the embedding vector is always larger than or equal to 0, which is not common in the face recognition (since their loss func are similar) from my point of view. Although ReLU can mitigate the overfit problem, our embedding is represented in a limited subspace of the original vector space. I know this setting is from the paper but maybe we could give other activation function a try, such as Leaky ReLU.

Liujingxiu23 commented 4 years ago

@geekboood, you train chinese encoder model from scratch? Have you started train the tacotron and wavernn model? How is the result? I finetuned the pretrained model with chinese speech with lr=0.00001, but the result(the image of "cross-similarity between utterance") is not as good as yours.

By the way, have you changed the activation function? In he paper, "stack of 3 LSTM layers of 768 cells, each followed by a projection to 256 dimensions. The final embedding is created by L2-normalizing the output of the top layer at the final frame" . I didnot found "activation function". Did I miss anything?

@CorentinJ I see "https://github.com/HarryVolek/PyTorch_Speaker_Verification/blob/master/speech_embedder_net.py" and "https://github.com/Janghyun1230/Speaker_Verification/blob/master/model.py", they both didnot use any activation function after projection , right ?

And, why did you use "torch.optim.Adam" instead SGD

ghost commented 4 years ago

I am closing this issue due to inactivity, please feel free to reopen when ready to continue to the discussion.