Question about Tacotron2 Speech Synthesis

Hi,

I've been looking at the PyTorch/SpeechSynthesis/Tacotron2 model in this repo, and I trained first with one of the subsets of the LJSpeech-1.1 to test whether my setup would work to train the models. I only ran it for 100 epochs and it gave results which were expected in quality.

I then switched to using my own data set which contains about 4500 samples between 2-15 seconds of speech. After 250 epochs, however, performing an inference then produces only white noise or static. I've looked at my data set to ensure it was being read ok, and the encoding was ok, and all seems well that I can see.

I wondered whether anyone could help either if they have found this similar type of problem before, or if they could suggest how I could go about evaluating the Tacotron2 and WaveGlow models.

One thing that does cause me some confusion, is that even for the LJSpeech dataset, the training loss of the WaveGlow model seems to be negative. Is this to be expected?

Many thanks

NVIDIA / DeepLearningExamples

Question about Tacotron2 Speech Synthesis #1012