Closed arnabdas8901 closed 1 year ago
Hi, it is hard to ensure that the speaker encoder encodes ONLY speaker information, as it is a global vector that may contain all kindes of global features that are time-invariant, such as speaker, channel, environment, etc. But in the paper, we assume that the speaker encoder encodes speaker information, results shown in Table 3 (s-accuracy) verify that the representation extracted from speaker encoder does contain speaker information. As for pitch-encoder, it is not used, we just feed the original pitch values as inputs of VC system.
Thanks a lot for your detailed and prompt reply.
Dear Team, Thank you very much for your paper and your code. I have a query about the representations. I understand that MI minimization makes the representations distinct from each other and they embed separate information. However, it is not clear to me how it ensures that the speaker encoder encodes the speaker feature only and the pitch encoder encodes the pitch feature. is there any other loss other than the ones mentioned in the paper? Sorry if I am missing something. I will be grateful if you can shed a bit of light on that.
Thanks Arnab