auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
983 stars 207 forks source link

Why need original speaker embeddings concatenated with original speaker spectrogram? #30

Open nkcdy opened 4 years ago

nkcdy commented 4 years ago

Theoretically, the original speaker embedding information has already been contained in the spectrogram. The network will automatically squeeze the original speaker embedding information out after convergence. why the original speaker embedding is still needed?

auspicious3000 commented 4 years ago

So that the encoder dose not need to learn that information from the spectrogram.

nkcdy commented 4 years ago

@auspicious3000 I don't quite understand what you mean. The "encoder" you mentioned is the content encoder or the speaker encoder? can you please explain it in more detail? I can't find the answer from the paper. Actually, I'm wondering what will happen if original speaker embedding is discarded from the content encoder...

auspicious3000 commented 4 years ago

content encoder Without the speaker emb, it is harder for the encoder to learn that information from the spectrogram. Since you already have that info, just give it to the encoder so that the encoder dose not need to learn that information from the spectrogram.

nkcdy commented 4 years ago

@auspicious3000 its still not intuitive for me to understand what you said. Maybe I should read some related papers behind this theory. Do you have any recommended papers?

auspicious3000 commented 4 years ago

There must be papers describing this technique, but I don't know any of them. It should be very simple and intuitive. For example, you need to solve A and B, but if I give you the answer of B, you only need to solve A. That's it.

nkcdy commented 4 years ago

@auspicious3000 yes, from the whole network view, you are right. Suppose B is the emb_trg and A is original content, it is intuitive. My question is the why emb_org is still needed to cascade with the original spectrogram when training because emb_org and emb_trg are identical when traning. My thought is the content encoder can still learn to generate only the content information even if there is no emb_org cascaded with original spectrogram.

auspicious3000 commented 4 years ago

My explanation is for the encoder and I perfectly understood your question. Without feeding emb_org the encoder can learn to disentangle the content and identity, but it will be easier if the identity is already given.

nkcdy commented 4 years ago

@auspicious3000 here is where i got confused. The content encoder is just an encoder or just an network. An network can generate any possible output unless it get some guide. In my though, the "guide" should come from the BP algorithm not from the input. Now i'm starting to know why the deep learning is sometimes called "alchemy" in China in some case. The emb_org in this case is very similar with a material called "YaoYin" in Ancient China alchemy (Don't take it seriously, just a joke). Maybe my issue is the way of thinking. I have used to think everything in a linear way while the deep network is not a linear system.

ruclion commented 3 years ago

@auspicious3000 here is where i got confused. The content encoder is just an encoder or just an network. An network can generate any possible output unless it get some guide. In my though, the "guide" should come from the BP algorithm not from the input. Now i'm starting to know why the deep learning is sometimes called "alchemy" in China in some case. The emb_org in this case is very similar with a material called "YaoYin" in Ancient China alchemy (Don't take it seriously, just a joke). Maybe my issue is the way of thinking. I have used to think everything in a linear way while the deep network is not a linear system.

thank you and the author. I think this idea which concatenate speaker embedding to mels is an outstanding innovation.

Both you two are right~ And this is somehow a really good idea

ruclion commented 3 years ago

@auspicious3000 here is where i got confused. The content encoder is just an encoder or just an network. An network can generate any possible output unless it get some guide. In my though, the "guide" should come from the BP algorithm not from the input. Now i'm starting to know why the deep learning is sometimes called "alchemy" in China in some case. The emb_org in this case is very similar with a material called "YaoYin" in Ancient China alchemy (Don't take it seriously, just a joke). Maybe my issue is the way of thinking. I have used to think everything in a linear way while the deep network is not a linear system.

thank you and the author. I think this idea which concatenate speaker embedding to mels is an outstanding innovation.

  • when data is not so much, or we do not train it convergence, this idea is useful
  • although speaker embedding can be the result of mels among NN encoder, but this idea is just like "struct output NN" or like guided attention like tacotron

Both you two are right~ And this is somehow a really good idea

I suddenly think out an example, multi-speaker ASR: