auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
973 stars 206 forks source link

How you generate speaker embedding? #34

Open hsiehjackson opened 4 years ago

hsiehjackson commented 4 years ago

I am wondering about how you extract the speaker embedding with pre-trained verification model.

The speaker embedding I get from https://github.com/resemble-ai/Resemblyzer will have a vector with all positive values and mostly zeros due to the ReLU at the end of the model. However, the speaker embedding in your metadata.pkl will have positive and negative values and it looks like a normal distribution.

Could you give me some advice how you extracted the exact embedding in your work? I try to skip the ReLU layer and L2-normalization for the vector but it is still not similar with your results. Hope to receive your response! Thanks!

By the way, your paper said that you used the speaker encoder with a stack of "2" LSTM layers with cell size "768". But, the model in https://github.com/resemble-ai/Resemblyzer used "3" LSTM layers with cell size "256". I was confused whether you use the same speaker encoder model.

auspicious3000 commented 4 years ago

You can use one-hot embeddings if you are not doing zero-shot conversion. I implemented my own speaker encoder, which has not been released. The Resemblyzer is just a similar implementation I found online. You don't have to use the same embeddings as we did.

DatanIMU commented 4 years ago

I have the same puzzled, however, can not find answer.

Other papers refer to 3 LSTM followed by 256 dimensions (input 40-chanel mel spectrograms). in order to broke utterance into 800ms windows, overlapped by 50%.

Our paper refers to 2 LSTM followed by 256 dimensions (input 80-dimensionalmel spectrograms).

Paper said it is done by "During inference". I'm wonderring the precess of inference. Could show me a way to understand it.

bva1986 commented 3 years ago

Hello

How many steps did you train speaker encoder and what optimizer was used for it?