auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
976 stars 207 forks source link

Differences in Talker Embedding Extraction #90

Open rppravin opened 3 years ago

rppravin commented 3 years ago

TalkerEmbQn

In the Auto VC paper, it seems,

Even though both these representations are estimated for the same talker, they are estimated based on different input speech. S2 is potentially based on longer speech duration. So, there could be some differences between the two talker embedding representations.

However, in the codebase, S2 is reused in the place Es(X1). Any idea on how much impact this will have on the extent of dis-entanglement of content and talker representation? Since Es(X1) could be based on shorter speech duration, will it be useful to estimate it separately, so that network learns to dis-entangle only what is appropriate talker information for a given input speech segment?

Thanks, Pravin