auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
1k stars 207 forks source link

Issues with conversion of VCTK speakers using pre-trained model #49

Open gustxsr opened 4 years ago

gustxsr commented 4 years ago

I am training a model on my own data to make conversions between two speakers with the training code you mailed me. I have tried various approaches to produce good conversions and, although I have gotten some average sounding output (with robotic crackles), the output is not as good as the samples in the audio demo website. As a part of debugging my training, I am investigating the pre-trained model (autovc.ckpt) to produce conversions between the speakers in the VCTK dataset (The dataset the model was trained on). Although I manage to get relatively good outputs using samples from the speakers in the metadata.pkl file as the target and source speaker/speech, when I add a sample from another speaker (p240 or p260 for example) in the VCTK dataset to achieve a conversion, the quality of the output is poor. Can you give some pointers as to why this could be happening? Is the model trained on the entire VCTK dataset, or only a portion of it? Here is some more information on how I generate the metadata.pkl file to do the conversions. To generate the speaker embedding of one recording, I use the pre-trained model that comes with the the training code ("3000000-BL.ckpt") with len_crop = 128. To generate the spectrogram, I use the code in the make_spect.py file that also comes in the training code and I leave the addition of some random noise (it gave better outputs this way). Thank you beforehand!

auspicious3000 commented 4 years ago

The pretrained model is only trained on a small set of speakers, which may not generalize well to other speakers.

You can use one-hot embedding if you are not doing zero-shot conversion.

Model tuning, audio quality, languages, vocoder, etc all contribute to the conversion quality. Also, the current architecture is not optimized for the voice conversion task. You can make changes to the architecture as long as it follows the concept of our paper.

We have also made improvements to our project and published them as follow-up works. You might find them useful.

cjw414 commented 4 years ago

@auspicious3000 Can you provide us the list of speakers that were used during training the pre-trained model?

mbronckers commented 4 years ago

@Jungwon-Chang you can check the metadata pickle file: p225, p228, p256, p270

cjw414 commented 4 years ago

@mbronckers I meant the whole list of speakers used for training. since the paper noted 9:1 train/test split, I assumed there would be 90+ speakers used during training.

I mailed the author, but got response that he also do not have access to that information.

Trebolium commented 3 years ago

@Jungwon-Chang Please correct me if I'm wrong (as I'm dying to know why my model won't produce good quality speech), but I think the paper describes that for many-to-many conversion, the model is trained on 20 speakers, and the split among these speakers utterances is 9:1. For 0-shot, its trained on 40 speakers. Hopep I'm not missing anything.

cjw414 commented 3 years ago

@Trebolium I have to review the paper again, but I don't think the model was trained for just 20 speakers. I think it was using whole 109(exact number could be incorrect) speakers from VTCK and splitting that into 9:1.

Trebolium commented 3 years ago

It is the utterances from each speaker that are split into 9:1 - lets say there are 800 utterances paper speaker, then the model would be trained on 720 utterances with 80 left for test set. Also bear in mind this split is only relevant for many-to-many conversions. The paper says that 20 speakers were used for many-to-many conversion, or 40 speakers for zero-shot conversion. I made this mistake as well!

On Wed, Dec 16, 2020 at 5:31 AM Jungwon-Chang notifications@github.com wrote:

@Trebolium https://github.com/Trebolium I have to review the paper again, but I don't think the model was trained for just 20 speakers. I think it was using whole 109(exact number could be incorrect) speakers from VTCK and splitting that into 9:1.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/auspicious3000/autovc/issues/49#issuecomment-745774709, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIMKTH4ZSN5HW6BMAFTUIJ3SVBA33ANCNFSM4OW7BKZA .

ghost commented 3 years ago

In the code, where is the difference between the zero shot and the many to many?

Trebolium commented 3 years ago

The code is a proof-of-concept of the zero-shot method. You would have to write the many-to-many yourself using one-hot encodings instead of speaker embeddings.

On Thu, Jan 7, 2021 at 7:11 AM paklau99988 notifications@github.com wrote:

In the code, where is the difference between the zero shot and the many to many?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/auspicious3000/autovc/issues/49#issuecomment-755932090, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIMKTH3C2AB6EFAQXHIOHHTSYVNBNANCNFSM4OW7BKZA .