auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
987 stars 205 forks source link

How to ensure that the output of the encoder is independent of the speaker? #3

Closed hyzhan closed 5 years ago

hyzhan commented 5 years ago

How to ensure that the output of the encoder is independent of the speaker? I can't see the concept of confusing networks or generative adversarial training in this paper. I don't understand how it works.

auspicious3000 commented 5 years ago

As the title "AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss" indicates, the main idea of the paper is to get rid of adversarial loss, instead of using it.

nkcdy commented 5 years ago

How to ensure that the output of the encoder is independent of the speaker? I can't see the concept of confusing networks or generative adversarial training in this paper. I don't understand how it works.

My guess is in training stage, the network uses only the reconstruction loss. That means if the output of the network match the input of the network perfectly, the network will FORCE the output of the encoder to be independent of the speaker because the speaker embedding has already been fed in through another path. Something like feedback theory.