Unable to reproduce decent quality generated audio with training data samples

Hi everyone,

Thanks a lot for releasing the code for the Mellotron model and amazing research work!

I was trying to reuse the checkpoints y'all posted and perform voice transfer using a sample from the training data (LibriTTS - the same subset train-clean-100 used for training the model). Specifically, I'm using the inference colab but I'm trying to run it for audio clips from training data (specifically 40_121026_000224_000000.wav with text So sudden and violent was the fit that the unfortunate prisoner was unable to complete the sentence; a violent convulsion shook his whole frame, his eyes started from their sockets, his mouth was drawn on one side, his cheeks became purple, he struggled, foamed, dashed himself about, and uttered the most dreadful cries, which, however, Dantes prevented from being heard by covering).

LibriTTS Speaker Id 40 is present in Mellotron with Id 26. So, in the input data to the model, I specified Mellotron Id as 26 and was trying to transfer the voice to a random target speaker. However, the quality of output is not as good as the samples on the website. I was wondering if I'm missing something?

Here is the colab: https://colab.research.google.com/drive/1e0GCP0fAFoXLMY7S_CUnJME9e4OCzOyy

I also found that if my text is slightly different that the audio content, eg: the audio was speaking 'stared' but the text corresponding to it had 'started', then the output audio that I receive has audio that is okay until 'stared', but gets significantly worse after the misspelled word. Is this because the decoder is auto-regressive? Is there a way to fix this issue?

NVIDIA / mellotron

Unable to reproduce decent quality generated audio with training data samples #41