S2 is estimated from > 20 sec of speech for a given talker
Es(X1) is estimated from the speech segment input to the content encoder
Even though both these representations are estimated for the same talker, they are estimated based on different input speech. S2 is potentially based on longer speech duration. So, there could be some differences between the two talker embedding representations.
However, in the codebase, S2 is reused in the place Es(X1). Any idea on how much impact this will have on the extent of dis-entanglement of content and talker representation? Since Es(X1) could be based on shorter speech duration, will it be useful to estimate it separately, so that network learns to dis-entangle only what is appropriate talker information for a given input speech segment?
In the Auto VC paper, it seems,
Even though both these representations are estimated for the same talker, they are estimated based on different input speech. S2 is potentially based on longer speech duration. So, there could be some differences between the two talker embedding representations.
However, in the codebase, S2 is reused in the place Es(X1). Any idea on how much impact this will have on the extent of dis-entanglement of content and talker representation? Since Es(X1) could be based on shorter speech duration, will it be useful to estimate it separately, so that network learns to dis-entangle only what is appropriate talker information for a given input speech segment?
Thanks, Pravin