Speech Synthesis Inference: how come different runs output different results?

NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.

13.66k stars 3.25k forks source link

Speech Synthesis Inference: how come different runs output different results? #387

Closed jeffxtang closed 4 years ago

jeffxtang commented 4 years ago

I trained a Tacotron 2 model for 1200 epochs (for about 24 hours on a single GV100 GPU) and a WaveGlow model for 800 epochs (for about 60 hours) on my own dataset, then I found running the inference.py (using the scripts/inference.sh with the two checkpoints) generates different results every time (sometimes the difference is small but other times big, even to the extent of non-recognizable).

The code already calls eval() to disable dropout during inference so I don't see where the randomness comes from. My dataset (about 3000 wavs in the training set) is about 2.x hours of audio. I also tried Tacotron 2 and WaveGlow checkpoints trained on the LJSpeech dataset and found that running inference generates somewhat different results every time too.

Why is this? How can I make the TTS result remain the same?

Thanks!

CookiePPP commented 4 years ago

https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/model.py#L129 dropout on the prenet is enabled regardless of eval()

ghost commented 4 years ago

hi @jeffxtang as @CookiePPP noted, the prenet in the Tacotron 2 model has dropouts enabled during inference. More that that, WaveGlow uses samples from random distribution to generate the audio in the reverse flow (see figure). Both features attribute to the varying output in every run.

If the synthesized speech is sometimes non-recognizable, try training Tacotron2 for more epochs (e.g. 1500)

jeffxtang commented 4 years ago

@CookiePPP @GrzegorzKarchNV Thanks! If Tacotron2 is well trained, even though "WaveGlow uses samples from random distribution", the difference between TTS results of the same text in different runs should be barely noticeable, right?

ghost commented 4 years ago

you can expect a bit different intonation and length of speech for each run, I would say it is noticeable

ghost commented 4 years ago

@jeffxtang could you send a few generated samples?

jeffxtang commented 4 years ago

@GrzegorzKarchNV yes there's some noticeable difference in the intonation as shown in the attachment.

samples.zip

prashantskit commented 2 years ago

@jeffxtang Does this issue persists after training tacotron for larger number of epochs (1500) as suggested above?