Closed kvnsq closed 4 years ago
@dodobyte Someone reproduced results and published in ICLR. Unfortunately, those who successfully produced the results don't post here.
@kvnsq Is your dataloader randomized?
@kvnsq Is your dataloader randomized?
No, it wasn't. Thanks for the tip, I will try it out. I've divided mel-specs to fixed 128 frame slices, shuffled them, and iterated over them. Now I iterate through shuffled mel-specs and randomly crop them on the fly because of your suggestion. Do you also randomly sample mel-specs from the train set in each training step, such that the same samples could be in two consecutive mini-batches?
In my previous tries I noticed that I could increase the batch size to 32 without a decrease in the loss compared to a batch size of 2. Because training is much faster that way, I've rerun the experiments with the on-the-fly random cropping, but couldn't see an increase in quality. I'm going to retrain the model with a batch size of 2 and LR of 0.0001 overnight and report if I see any changes in quality.
Hi, first of all thanks for the great work on the AutoVC system. I have tried to replicate the system, but could not achieve to achieve nearly the same quality as the pre-trained system. I use the the same pre-processing for the mel-spectrograms as discussed in issue #4 and have trained the system with he same 20 VCTK speakers of the experiment of the paper (additionally with 8 speakers from the VCC data set, however results were similar when they were omitted). I additionally used one-hot encodings instead of speaker embeddings of an encoder.
I trained for about 300.000 steps using Adam with default parameters and a LR of 0.0001, the train loss is about 6.67e-3 and the validation loss is about 0.01 and rising. I've also tried out other learning rates (0.001, 0.0005) with no improvement of quality. The converted mel-spectrograms are still blurry and produce a low-quality, robotic voice. In comparison, the converted mel-spectrograms of the supplied autovc model are much sharper and produce more natural voice, even when used with Griffin-Lim. Here are the mel-spectrograms of my retrained model and the model of the repo
Retrained model Supplied model Here is a minimal example of the loss and training loop I use. I can also provide more of my code if wanted.
def train_step(mel_spec_batch, embeddings_batch, generator, weight_mu_zero_rec: float, weight_lambda_content: float): optimizer.zero_grad() mel_spec_batch_exp = mel_spec_batch.unsqueeze(1) # (batch_size=2, 1, num_frames=128, num_mels=80) mel_outputs, mel_outputs_postnet, content_codes_mel_input = generator(mel_spec_batch, embeddings_batch, embeddings_batch) # Returns content codes with self.encoder without using the decoder and postnet a second time content_codes_gen_output = generator.get_content_codes(mel_outputs_postnet, embeddings_batch) rec_loss = F.mse_loss(input=mel_outputs_postnet, target=mel_spec_batch_exp, reduction="mean") rec_0_loss = F.mse_loss(input=mel_outputs, target=mel_spec_batch_exp, reduction="mean") content_loss = F.l1_loss(input=content_codes_gen_output, target=content_codes_mel_input, reduction="mean") total_loss = rec_loss + weight_mu_zero_rec * rec_0_loss + weight_lambda_content * content_loss total_loss.backward() optimizer.step() # Train loop.. for epoch in range(start_epoch + 1, args[FLAGS.MAX_NUM_EPOCHS] + 1): generator.train() # Iterate over Mel-Spec Slices and the index of their speakers for step_idx, (mel_spec_batch, speaker_idx_batch) in enumerate(train_set_loader): # Load the speaker embeddings of the speakers of the mel-spectograms spkr_embeddings = speaker_embedding_mat[speaker_idx_batch.to(device)].to(device) train_step(mel_spec_batch.to(device), spkr_embeddings, generator, optim, weight_mu_zero_rec=args[FLAGS.AUTO_VC_MU_REC_LOSS_BEFORE_POSTNET], # == 1.0 weight_lambda_content=args[FLAGS.AUTO_VC_LAMBDA_CONTENT_LOSS]) # == 1.0 # The rest is computing the validation loss, resynthesizing utterances, saving the model every n epochs, etc
Does anyone have an idea what is wrong with my re-implementation or could anyone reimplement the system with good quality?
Thanks a lot in advance.
hi, I didn't find code about style encoder. Did you implement the style encoder?
Please refer to #24, or simply use one-hot embeddings.
@kvnsq Did you manage to improve training of model ? I have the same problem. Retrained model produce poor quality mel-spectrograms. I've tried 20/40 VCTK speakers, 2-32 batch size, various mel-specs generating frameworks.
@zonerby No, I gave up on this model and am currently working on other VC models. I might try to retrain the AutoVC model with other NN architectures in the future, but that has lower priority for me right now. If I do retrain it and see some improvements, I'll let you know.
I finally get this model works on a Chinese corpus and get quite good output quality even convert from unseen speaker to unseen speaker, I use 120 speakers from the corpus, 120 utterances per speaker, 1e-4 learning rate, 4 batch size, the content embedding is downsampled
by size of 16 (i.e. Generator(32, 256, 512, 16)
), the default size 32 not works, and I use a GE2E speaker encoder trained on a dataset combined by several Chinese corpus with total ~2800 speakers, I use l1_loss
instead of mse_loss
for all three losses, the model is trained for 370k steps and achieve ~0.045 training loss.
This is how one of the training sample looks like, I also plot the downsampled content embedding along with the upsampled one as well:
This is a plot convert from unseen male to unseen female, I plot the source utterance, converted utterance, one of the target speaker utterance (the target speaker embedding is averaged from 5 utterances while inferencing), and the self-reconstruct utterance:
The complete reproducible training code has been released in this repo.
I finally get this model works on a Chinese corpus and get quite good output quality even convert from unseen speaker to unseen speaker, I use 120 speakers from the corpus, 120 utterances per speaker, 1e-4 learning rate, 4 batch size, the content embedding is downsampled by size of 16 (i.e.
Generator(32, 256, 512, 16)
), the default size 32 not works, and I use a GE2E speaker encoder trained on a dataset combined by several Chinese corpus with total ~2800 speakers, I usel1_loss
instead ofmse_loss
for all three losses, the model is trained for 370k steps and achieve ~0.045 training loss.This is how one of the training sample looks like, I also plot the downsampled content embedding along with the upsampled one as well:
This is a plot convert from unseen male to unseen female, I plot the source utterance, converted utterance, one of the target speaker utterance (the target speaker embedding is averaged from 5 utterances while inferencing), and the self-reconstruct utterance:
So can you upload some samples publically to judge the quality of the converted audios on chinese dataset? Thanks
@LG-SS Try it yourself: https://apps.apple.com/cn/app/%E9%AD%94%E6%9C%AF%E5%8F%98%E5%A3%B0%E5%99%A8/id1499265894
@ye2020 May I ask what vocoder did you use to convert from spectrograms to waveforms? Thx!
@auspicious3000 It's melgan https://github.com/descriptinc/melgan-neurips/issues/4#issuecomment-570106464
I finally get this model works on a Chinese corpus and get quite good output quality even convert from unseen speaker to unseen speaker, I use 120 speakers from the corpus, 120 utterances per speaker, 1e-4 learning rate, 4 batch size, the content embedding is downsampled by size of 16 (i.e.
Generator(32, 256, 512, 16)
), the default size 32 not works, and I use a GE2E speaker encoder trained on a dataset combined by several Chinese corpus with total ~2800 speakers, I usel1_loss
instead ofmse_loss
for all three losses, the model is trained for 370k steps and achieve ~0.045 training loss.This is how one of the training sample looks like, I also plot the downsampled content embedding along with the upsampled one as well:
This is a plot convert from unseen male to unseen female, I plot the source utterance, converted utterance, one of the target speaker utterance (the target speaker embedding is averaged from 5 utterances while inferencing), and the self-reconstruct utterance:
Would you please publicly release your code and model. I've tried many times with different corpora but all results are merely nonsense noise.
Hi @himajin2045 Could you please explain why did you choose L1 loss instead of L2 in origin design?
@DouYishun It's been a year and I don't quite remember the exact reason, but I'm sure it was not a theoretical one.
@shoutong Sorry for the late reply, and for those who are interested in my code, here you go https://github.com/himajin2045/voice-conversion
@himajin2045 Thanks for your good job. I am training a mandarin vc model now. I am using the AISHell-3 dataset which is more then 70 hours speech with 174 different speakers. The speaker embedding model I am using is the pretrained one in this project. And no parameter changed throuth my experiment.
But my vc model can product only noise, no speech waveform is generated. The reasons I guessed are as follows :
1、 the speaker embedding should be a new one trained with Mandarin corpus [I did not found a pretrained one with Mandarin]
2、Maybe the training corpus is too small, more samples needed.
3、There are same hparams should be changed before my training preprocessing.
4、The make_spect.py
may need some change, for more commonly mel
used in training process.
But I do not have clear sense which should I do.
Have you any suggestions? Thanks
@JohnHerry
Sorry, I didn't try the updated pretrained models published in this repo. But as you can see in my previous comments, with 120 speakers is enough in my experience.
Perhaps you can try my code to train your own speaker encoder and vocoder.
@JohnHerry
Sorry, I didn't try the updated pretrained models published in this repo. But as you can see in my previous comments, with 120 speakers is enough in my experience.
Perhaps you can try my code to train your own speaker encoder and vocoder.
Thanks for your help, and I have noticed your project on VC, I will have a try.
@taubaaron Please disregard that message. The code has already been released.
Hi, first of all thanks for the great work on the AutoVC system. I have tried to replicate the system, but could not achieve to achieve nearly the same quality as the pre-trained system. I use the the same pre-processing for the mel-spectrograms as discussed in issue #4 and have trained the system with he same 20 VCTK speakers of the experiment of the paper (additionally with 8 speakers from the VCC data set, however results were similar when they were omitted). I additionally used one-hot encodings instead of speaker embeddings of an encoder.
I trained for about 300.000 steps using Adam with default parameters and a LR of 0.0001, the train loss is about 6.67e-3 and the validation loss is about 0.01 and rising. I've also tried out other learning rates (0.001, 0.0005) with no improvement of quality. The converted mel-spectrograms are still blurry and produce a low-quality, robotic voice. In comparison, the converted mel-spectrograms of the supplied autovc model are much sharper and produce more natural voice, even when used with Griffin-Lim. Here are the mel-spectrograms of my retrained model and the model of the repo
Here is a minimal example of the loss and training loop I use. I can also provide more of my code if wanted.
Does anyone have an idea what is wrong with my re-implementation or could anyone reimplement the system with good quality?
Thanks a lot in advance.