auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
987 stars 205 forks source link

The output encoder #12

Open nkcdy opened 5 years ago

nkcdy commented 5 years ago

it seems that the output encoder should be an extra module with the same structure and the same weight with the input encoder. But it is very difficult to get convergence in my training. Correct me if I am wrong.

auspicious3000 commented 5 years ago

decoder and encoder are different, they don't share weights

nkcdy commented 5 years ago

what I mean is the content encoder for the output signal, not the decoder.

auspicious3000 commented 5 years ago

yes, just feed the reconstruction back into the encoder

liveroomand commented 5 years ago

@nkcdy Have you solved your speaker feature extraction problem?How you did it.

liveroomand commented 5 years ago

@nkcdy I am also implementing this paper and find that I have encountered many problems like you. Can I communicate with you?I've been able to voice conversion , but there's still a certain amount of background noise.

Trebolium commented 3 years ago

@liveroomand @nkcdy Did either of you finally figure out what the secret sauce is for training a version that converges to 0.0001 and yields audio of a similar quality to what is produced by the pretrained? I get very noisey conversions at 100k iterations, but after 1M iterations, I get conversions of similar quality to those examples produced by the pre-trained network provided. I have also been using the training code that was recently uploaded to the repo 3 months ago.