NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 187 forks source link

Unable to reproduce decent quality generated audio with training data samples #41

Closed rohanbadlani closed 4 years ago

rohanbadlani commented 4 years ago

Hi everyone,

Thanks a lot for releasing the code for the Mellotron model and amazing research work!

I was trying to reuse the checkpoints y'all posted and perform voice transfer using a sample from the training data (LibriTTS - the same subset train-clean-100 used for training the model). Specifically, I'm using the inference colab but I'm trying to run it for audio clips from training data (specifically 40_121026_000224_000000.wav with text So sudden and violent was the fit that the unfortunate prisoner was unable to complete the sentence; a violent convulsion shook his whole frame, his eyes started from their sockets, his mouth was drawn on one side, his cheeks became purple, he struggled, foamed, dashed himself about, and uttered the most dreadful cries, which, however, Dantes prevented from being heard by covering).

  1. LibriTTS Speaker Id 40 is present in Mellotron with Id 26. So, in the input data to the model, I specified Mellotron Id as 26 and was trying to transfer the voice to a random target speaker. However, the quality of output is not as good as the samples on the website. I was wondering if I'm missing something?

Here is the colab: https://colab.research.google.com/drive/1e0GCP0fAFoXLMY7S_CUnJME9e4OCzOyy

  1. I also found that if my text is slightly different that the audio content, eg: the audio was speaking 'stared' but the text corresponding to it had 'started', then the output audio that I receive has audio that is okay until 'stared', but gets significantly worse after the misspelled word. Is this because the decoder is auto-regressive? Is there a way to fix this issue?
rafaelvalle commented 4 years ago
  1. Check the quality of the rhythm (alignment map) and pitch (F0 contour)
  2. The rhythm probably does not align properly.