Wendison / VQMIVC

Official implementation of VQMIVC: One-shot (any-to-any) Voice Conversion @ Interspeech 2021 + Online playing demo!
MIT License
337 stars 55 forks source link

Question about different embeddings/representations #43

Closed arnabdas8901 closed 1 year ago

arnabdas8901 commented 1 year ago

Dear Team, Thank you very much for your paper and your code. I have a query about the representations. I understand that MI minimization makes the representations distinct from each other and they embed separate information. However, it is not clear to me how it ensures that the speaker encoder encodes the speaker feature only and the pitch encoder encodes the pitch feature. is there any other loss other than the ones mentioned in the paper? Sorry if I am missing something. I will be grateful if you can shed a bit of light on that.

Thanks Arnab

Wendison commented 1 year ago

Hi, it is hard to ensure that the speaker encoder encodes ONLY speaker information, as it is a global vector that may contain all kindes of global features that are time-invariant, such as speaker, channel, environment, etc. But in the paper, we assume that the speaker encoder encodes speaker information, results shown in Table 3 (s-accuracy) verify that the representation extracted from speaker encoder does contain speaker information. As for pitch-encoder, it is not used, we just feed the original pitch values as inputs of VC system.

arnabdas8901 commented 1 year ago

Thanks a lot for your detailed and prompt reply.