Closed jeffxtang closed 4 years ago
https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/model.py#L129 dropout on the prenet is enabled regardless of eval()
hi @jeffxtang as @CookiePPP noted, the prenet in the Tacotron 2 model has dropouts enabled during inference. More that that, WaveGlow uses samples from random distribution to generate the audio in the reverse flow (see figure). Both features attribute to the varying output in every run.
If the synthesized speech is sometimes non-recognizable, try training Tacotron2 for more epochs (e.g. 1500)
@CookiePPP @GrzegorzKarchNV Thanks! If Tacotron2 is well trained, even though "WaveGlow uses samples from random distribution", the difference between TTS results of the same text in different runs should be barely noticeable, right?
you can expect a bit different intonation and length of speech for each run, I would say it is noticeable
@jeffxtang could you send a few generated samples?
@GrzegorzKarchNV yes there's some noticeable difference in the intonation as shown in the attachment.
@jeffxtang Does this issue persists after training tacotron for larger number of epochs (1500) as suggested above?
I trained a Tacotron 2 model for 1200 epochs (for about 24 hours on a single GV100 GPU) and a WaveGlow model for 800 epochs (for about 60 hours) on my own dataset, then I found running the inference.py (using the scripts/inference.sh with the two checkpoints) generates different results every time (sometimes the difference is small but other times big, even to the extent of non-recognizable).
The code already calls eval() to disable dropout during inference so I don't see where the randomness comes from. My dataset (about 3000 wavs in the training set) is about 2.x hours of audio. I also tried Tacotron 2 and WaveGlow checkpoints trained on the LJSpeech dataset and found that running inference generates somewhat different results every time too.
Why is this? How can I make the TTS result remain the same?
Thanks!