auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
976 stars 207 forks source link

Differences in Architecture Between Code and Paper #100

Open taubaaron opened 2 years ago

taubaaron commented 2 years ago

Hey, firstly - thank you very much for sharing your work, it really is interesting.

I have a few issues regarding the implication of the paper: "AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss":

  1. In section 3.1 "Problem Formulation", it is explained (and showed in figure 1) that the output from the speaker encoder (input was target speaker utterance) is fed directly into the decoder (after the bottleneck). In the code implementation on the other hand, it seems that the output from the speaker encoder is actually concatenated with the Mel_spectrogram and fed into the content encoder and not later after the bottleneck.

  2. Again, in figure 1, it is shown that during train stage the "style" is used from the same speaker but in another file/section for comparison. Is that implemented in the code too? it didn't seem like it but I might be missing something.

  3. In Table 1 (page 8), you present results for classification testing, for the output of the content encoder. Is there a way I can try to regenerate the same results? (can you share this part of the code too?)

Thanks very much, Aaron

auspicious3000 commented 2 years ago
  1. The speaker emb is also concatenated with the encoder output before feeding into the decoder.
  2. yes, the speaker emb is extracted from the same speaker but most likely different utterances.
  3. just train a classifier on the encoder output