auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
983 stars 207 forks source link

unable to apply voice conversion to long files using my trained speaker embedding #57

Open xanguera opened 3 years ago

xanguera commented 3 years ago

Hi, I am attempting zero-shot voice conversion by using only a few audio sentences of a target speaker. I am training a speaker embedding for this speaker using the make_spect.py and make_metadata.py (originally thought for training) and extract the embedding from the resulting train.pkl file. I do the same for the source speaker. As a comparison, I also perform VC between 2 of the speakers provided in the metadata.pkl file.

I am then applying the embedding to perform VC by modifying conversion.py code and things start to fall apart. Here is what I see:

Has anyone experienced this? the author says that the system has been trained with short audios, but would this explain this behavior? mostly given that when using the embeddings in metadata.pkl it always sounds well, regardless of the length? Have the embeddings in metadata.pkl been computed in the exact same way as the embeddings I am now computing? (note that I have tried disabling the random noise added in the make_metadata.py script, but same results).

Thanks!

zzw922cn commented 3 years ago

hi, what's reconstruction loss of your converged model? I want to know when can I reproduce the Voice Conversion results in domain? thank you~~

ruclion commented 3 years ago

Hi, I am attempting zero-shot voice conversion by using only a few audio sentences of a target speaker. I am training a speaker embedding for this speaker using the make_spect.py and make_metadata.py (originally thought for training) and extract the embedding from the resulting train.pkl file. I do the same for the source speaker. As a comparison, I also perform VC between 2 of the speakers provided in the metadata.pkl file.

I am then applying the embedding to perform VC by modifying conversion.py code and things start to fall apart. Here is what I see:

  • If I use any of the embeddings in metadata.pkl as target and source, the resulting audio sounds good, either if I convert a short (2s) or long (7s file).
  • If I use my computed embeddings, the resulting audio sounds good if it is short, but for long files I get garbled audio after the second 2-3 (sometimes earlier).

Has anyone experienced this? the author says that the system has been trained with short audios, but would this explain this behavior? mostly given that when using the embeddings in metadata.pkl it always sounds well, regardless of the length? Have the embeddings in metadata.pkl been computed in the exact same way as the embeddings I am now computing? (note that I have tried disabling the random noise added in the make_metadata.py script, but same results).

Thanks!

maybe use your own speaker's dataset to fine-tune the autovc's content encoder and decoder?

ruclion commented 3 years ago

hi, what's reconstruction loss of your converged model? I want to know when can I reproduce the Voice Conversion results in domain? thank you~~

he just use speaker encoder to get speaker features, and use it on pretrained autovc model, so he maybe not train~

MuyangDu commented 3 years ago

Same issue here. Just tried the pretrained model(downloaded from the repo) without any finetuning. I use some clean speech audios from a female speaker outside the VCTK dataset, let's call her speaker A. Here is the produre:

audios of A -> pretrained speaker encoder -> speaker embedding of A audios of p227 in VCTK -> pretrained speaker encoder -> speaker embedding of p227 (speaker embedding of A, speaker embedding of p227, spectrogram of p227) -> pretrained autovc -> pretrained wavenet -> generated speech audio of A

However, the generated speech of A get garbled after the second 2-3. Here is the generated audio of A: generated.wav.zip Here is the origin audio of p227: p227_005.wav.zip

From the generated audio we can see that the speaker identity is successfully converted to A but the actual content of speech is lost.

But if I convert p227 to p225 (both of them are VCTK speakers) using the exact same proceudre described above and exact the same pretrained models. It works fine. (Although shorter audios performs better than long audios, but at least the content is correct).

From the paper, I found this "... as long as seen speakers are included in either side of the conversions, the performance is comparable...". A is an unseen speaker, not sure if the p227 is seen during training.

So, here are some guessings:

  1. Since the generated audio sounds like A, the speaker encoder should be working fine.
  2. Since the p227 -> p225 conversion is fine in content, the content encoder should be working fine.
  3. But the content is lost in the generated audio of A, so may be the decoder is overfitting to the VCTK(VCTK content + VCTK speaker embedding) ? If we replace the input of decoder to (VCTK content + Unseen speaker embedding), it will fail.

I haven't tried finetuning yet. Just some thoughs. Any ideas guys? @xanguera @ruclion @auspicious3000