auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
1.01k stars 207 forks source link

Bad conversion quality after retraining #33

Closed kvnsq closed 4 years ago

kvnsq commented 5 years ago

Hi, first of all thanks for the great work on the AutoVC system. I have tried to replicate the system, but could not achieve to achieve nearly the same quality as the pre-trained system. I use the the same pre-processing for the mel-spectrograms as discussed in issue #4 and have trained the system with he same 20 VCTK speakers of the experiment of the paper (additionally with 8 speakers from the VCC data set, however results were similar when they were omitted). I additionally used one-hot encodings instead of speaker embeddings of an encoder.

I trained for about 300.000 steps using Adam with default parameters and a LR of 0.0001, the train loss is about 6.67e-3 and the validation loss is about 0.01 and rising. I've also tried out other learning rates (0.001, 0.0005) with no improvement of quality. The converted mel-spectrograms are still blurry and produce a low-quality, robotic voice. In comparison, the converted mel-spectrograms of the supplied autovc model are much sharper and produce more natural voice, even when used with Griffin-Lim. Here are the mel-spectrograms of my retrained model and the model of the repo

Retrained model Supplied model
p270-p228-own p270-p228-paper

Here is a minimal example of the loss and training loop I use. I can also provide more of my code if wanted.

def train_step(mel_spec_batch, embeddings_batch, generator,
               weight_mu_zero_rec: float, weight_lambda_content: float):
    optimizer.zero_grad()

    mel_spec_batch_exp = mel_spec_batch.unsqueeze(1) # (batch_size=2, 1, num_frames=128, num_mels=80)
    mel_outputs, mel_outputs_postnet, content_codes_mel_input = generator(mel_spec_batch,
                                                                          embeddings_batch,
                                                                          embeddings_batch)
    # Returns content codes with self.encoder without using the decoder and postnet a second time
    content_codes_gen_output = generator.get_content_codes(mel_outputs_postnet, embeddings_batch)

    rec_loss = F.mse_loss(input=mel_outputs_postnet, target=mel_spec_batch_exp, reduction="mean")
    rec_0_loss = F.mse_loss(input=mel_outputs, target=mel_spec_batch_exp, reduction="mean")
    content_loss = F.l1_loss(input=content_codes_gen_output, target=content_codes_mel_input, reduction="mean")
    total_loss = rec_loss + weight_mu_zero_rec * rec_0_loss + weight_lambda_content * content_loss

    total_loss.backward()
    optimizer.step()

# Train loop..
for epoch in range(start_epoch + 1, args[FLAGS.MAX_NUM_EPOCHS] + 1):
    generator.train()
    # Iterate over Mel-Spec Slices and the index of their speakers
    for step_idx, (mel_spec_batch, speaker_idx_batch) in enumerate(train_set_loader):
        # Load the speaker embeddings of the speakers of the mel-spectograms
        spkr_embeddings = speaker_embedding_mat[speaker_idx_batch.to(device)].to(device)
    train_step(mel_spec_batch.to(device), spkr_embeddings, generator, optim,
                   weight_mu_zero_rec=args[FLAGS.AUTO_VC_MU_REC_LOSS_BEFORE_POSTNET], # == 1.0
                   weight_lambda_content=args[FLAGS.AUTO_VC_LAMBDA_CONTENT_LOSS]) # == 1.0
    # The rest is computing the validation loss, resynthesizing utterances, saving the model every n epochs, etc

Does anyone have an idea what is wrong with my re-implementation or could anyone reimplement the system with good quality?

Thanks a lot in advance.

auspicious3000 commented 5 years ago

@dodobyte Someone reproduced results and published in ICLR. Unfortunately, those who successfully produced the results don't post here.

auspicious3000 commented 5 years ago

@kvnsq Is your dataloader randomized?

kvnsq commented 5 years ago

@kvnsq Is your dataloader randomized?

No, it wasn't. Thanks for the tip, I will try it out. I've divided mel-specs to fixed 128 frame slices, shuffled them, and iterated over them. Now I iterate through shuffled mel-specs and randomly crop them on the fly because of your suggestion. Do you also randomly sample mel-specs from the train set in each training step, such that the same samples could be in two consecutive mini-batches?

In my previous tries I noticed that I could increase the batch size to 32 without a decrease in the loss compared to a batch size of 2. Because training is much faster that way, I've rerun the experiments with the on-the-fly random cropping, but couldn't see an increase in quality. I'm going to retrain the model with a batch size of 2 and LR of 0.0001 overnight and report if I see any changes in quality.

youngsuenXMLY commented 4 years ago

Hi, first of all thanks for the great work on the AutoVC system. I have tried to replicate the system, but could not achieve to achieve nearly the same quality as the pre-trained system. I use the the same pre-processing for the mel-spectrograms as discussed in issue #4 and have trained the system with he same 20 VCTK speakers of the experiment of the paper (additionally with 8 speakers from the VCC data set, however results were similar when they were omitted). I additionally used one-hot encodings instead of speaker embeddings of an encoder.

I trained for about 300.000 steps using Adam with default parameters and a LR of 0.0001, the train loss is about 6.67e-3 and the validation loss is about 0.01 and rising. I've also tried out other learning rates (0.001, 0.0005) with no improvement of quality. The converted mel-spectrograms are still blurry and produce a low-quality, robotic voice. In comparison, the converted mel-spectrograms of the supplied autovc model are much sharper and produce more natural voice, even when used with Griffin-Lim. Here are the mel-spectrograms of my retrained model and the model of the repo

Retrained model Supplied model p270-p228-own p270-p228-paper Here is a minimal example of the loss and training loop I use. I can also provide more of my code if wanted.

def train_step(mel_spec_batch, embeddings_batch, generator,
               weight_mu_zero_rec: float, weight_lambda_content: float):
    optimizer.zero_grad()

    mel_spec_batch_exp = mel_spec_batch.unsqueeze(1) # (batch_size=2, 1, num_frames=128, num_mels=80)
    mel_outputs, mel_outputs_postnet, content_codes_mel_input = generator(mel_spec_batch,
                                                                          embeddings_batch,
                                                                          embeddings_batch)
    # Returns content codes with self.encoder without using the decoder and postnet a second time
    content_codes_gen_output = generator.get_content_codes(mel_outputs_postnet, embeddings_batch)

    rec_loss = F.mse_loss(input=mel_outputs_postnet, target=mel_spec_batch_exp, reduction="mean")
    rec_0_loss = F.mse_loss(input=mel_outputs, target=mel_spec_batch_exp, reduction="mean")
    content_loss = F.l1_loss(input=content_codes_gen_output, target=content_codes_mel_input, reduction="mean")
    total_loss = rec_loss + weight_mu_zero_rec * rec_0_loss + weight_lambda_content * content_loss

    total_loss.backward()
    optimizer.step()

# Train loop..
for epoch in range(start_epoch + 1, args[FLAGS.MAX_NUM_EPOCHS] + 1):
    generator.train()
    # Iterate over Mel-Spec Slices and the index of their speakers
    for step_idx, (mel_spec_batch, speaker_idx_batch) in enumerate(train_set_loader):
        # Load the speaker embeddings of the speakers of the mel-spectograms
        spkr_embeddings = speaker_embedding_mat[speaker_idx_batch.to(device)].to(device)
  train_step(mel_spec_batch.to(device), spkr_embeddings, generator, optim,
                   weight_mu_zero_rec=args[FLAGS.AUTO_VC_MU_REC_LOSS_BEFORE_POSTNET], # == 1.0
                   weight_lambda_content=args[FLAGS.AUTO_VC_LAMBDA_CONTENT_LOSS]) # == 1.0
  # The rest is computing the validation loss, resynthesizing utterances, saving the model every n epochs, etc

Does anyone have an idea what is wrong with my re-implementation or could anyone reimplement the system with good quality?

Thanks a lot in advance.

hi, I didn't find code about style encoder. Did you implement the style encoder?

auspicious3000 commented 4 years ago

Please refer to #24, or simply use one-hot embeddings.

zonerby commented 4 years ago

@kvnsq Did you manage to improve training of model ? I have the same problem. Retrained model produce poor quality mel-spectrograms. I've tried 20/40 VCTK speakers, 2-32 batch size, various mel-specs generating frameworks.

kvnsq commented 4 years ago

@zonerby No, I gave up on this model and am currently working on other VC models. I might try to retrain the AutoVC model with other NN architectures in the future, but that has lower priority for me right now. If I do retrain it and see some improvements, I'll let you know.

himajin2045 commented 4 years ago

I finally get this model works on a Chinese corpus and get quite good output quality even convert from unseen speaker to unseen speaker, I use 120 speakers from the corpus, 120 utterances per speaker, 1e-4 learning rate, 4 batch size, the content embedding is downsampled by size of 16 (i.e. Generator(32, 256, 512, 16)), the default size 32 not works, and I use a GE2E speaker encoder trained on a dataset combined by several Chinese corpus with total ~2800 speakers, I use l1_loss instead of mse_loss for all three losses, the model is trained for 370k steps and achieve ~0.045 training loss.

This is how one of the training sample looks like, I also plot the downsampled content embedding along with the upsampled one as well:

Screen Shot 2020-01-22 at 4 47 53 AM

This is a plot convert from unseen male to unseen female, I plot the source utterance, converted utterance, one of the target speaker utterance (the target speaker embedding is averaged from 5 utterances while inferencing), and the self-reconstruct utterance:

Screen Shot 2020-01-22 at 5 12 16 AM
auspicious3000 commented 4 years ago

The complete reproducible training code has been released in this repo.

LG-SS commented 4 years ago

I finally get this model works on a Chinese corpus and get quite good output quality even convert from unseen speaker to unseen speaker, I use 120 speakers from the corpus, 120 utterances per speaker, 1e-4 learning rate, 4 batch size, the content embedding is downsampled by size of 16 (i.e. Generator(32, 256, 512, 16)), the default size 32 not works, and I use a GE2E speaker encoder trained on a dataset combined by several Chinese corpus with total ~2800 speakers, I use l1_loss instead of mse_loss for all three losses, the model is trained for 370k steps and achieve ~0.045 training loss.

This is how one of the training sample looks like, I also plot the downsampled content embedding along with the upsampled one as well:

Screen Shot 2020-01-22 at 4 47 53 AM

This is a plot convert from unseen male to unseen female, I plot the source utterance, converted utterance, one of the target speaker utterance (the target speaker embedding is averaged from 5 utterances while inferencing), and the self-reconstruct utterance:

Screen Shot 2020-01-22 at 5 12 16 AM

So can you upload some samples publically to judge the quality of the converted audios on chinese dataset? Thanks

himajin2045 commented 4 years ago

@LG-SS Try it yourself: https://apps.apple.com/cn/app/%E9%AD%94%E6%9C%AF%E5%8F%98%E5%A3%B0%E5%99%A8/id1499265894

auspicious3000 commented 4 years ago

@ye2020 May I ask what vocoder did you use to convert from spectrograms to waveforms? Thx!

himajin2045 commented 4 years ago

@auspicious3000 It's melgan https://github.com/descriptinc/melgan-neurips/issues/4#issuecomment-570106464

shoutong commented 4 years ago

I finally get this model works on a Chinese corpus and get quite good output quality even convert from unseen speaker to unseen speaker, I use 120 speakers from the corpus, 120 utterances per speaker, 1e-4 learning rate, 4 batch size, the content embedding is downsampled by size of 16 (i.e. Generator(32, 256, 512, 16)), the default size 32 not works, and I use a GE2E speaker encoder trained on a dataset combined by several Chinese corpus with total ~2800 speakers, I use l1_loss instead of mse_loss for all three losses, the model is trained for 370k steps and achieve ~0.045 training loss.

This is how one of the training sample looks like, I also plot the downsampled content embedding along with the upsampled one as well:

Screen Shot 2020-01-22 at 4 47 53 AM

This is a plot convert from unseen male to unseen female, I plot the source utterance, converted utterance, one of the target speaker utterance (the target speaker embedding is averaged from 5 utterances while inferencing), and the self-reconstruct utterance:

Screen Shot 2020-01-22 at 5 12 16 AM

Would you please publicly release your code and model. I've tried many times with different corpora but all results are merely nonsense noise.

Yishun99 commented 3 years ago

Hi @himajin2045 Could you please explain why did you choose L1 loss instead of L2 in origin design?

himajin2045 commented 3 years ago

@DouYishun It's been a year and I don't quite remember the exact reason, but I'm sure it was not a theoretical one.

himajin2045 commented 3 years ago

@shoutong Sorry for the late reply, and for those who are interested in my code, here you go https://github.com/himajin2045/voice-conversion

JohnHerry commented 3 years ago

@himajin2045 Thanks for your good job. I am training a mandarin vc model now. I am using the AISHell-3 dataset which is more then 70 hours speech with 174 different speakers. The speaker embedding model I am using is the pretrained one in this project. And no parameter changed throuth my experiment.

But my vc model can product only noise, no speech waveform is generated. The reasons I guessed are as follows :

1、 the speaker embedding should be a new one trained with Mandarin corpus [I did not found a pretrained one with Mandarin]

2、Maybe the training corpus is too small, more samples needed.

3、There are same hparams should be changed before my training preprocessing.

4、The make_spect.py may need some change, for more commonly mel used in training process.

But I do not have clear sense which should I do.

Have you any suggestions? Thanks

himajin2045 commented 3 years ago

@JohnHerry

Sorry, I didn't try the updated pretrained models published in this repo. But as you can see in my previous comments, with 120 speakers is enough in my experience.

Perhaps you can try my code to train your own speaker encoder and vocoder.

JohnHerry commented 3 years ago

@JohnHerry

Sorry, I didn't try the updated pretrained models published in this repo. But as you can see in my previous comments, with 120 speakers is enough in my experience.

Perhaps you can try my code to train your own speaker encoder and vocoder.

Thanks for your help, and I have noticed your project on VC, I will have a try.

auspicious3000 commented 3 years ago

@taubaaron Please disregard that message. The code has already been released.