NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
12.93k stars 3.12k forks source link

Question about Tacotron2 Speech Synthesis #1012

Open jkw1jkw1 opened 2 years ago

jkw1jkw1 commented 2 years ago

Hi,

I've been looking at the PyTorch/SpeechSynthesis/Tacotron2 model in this repo, and I trained first with one of the subsets of the LJSpeech-1.1 to test whether my setup would work to train the models. I only ran it for 100 epochs and it gave results which were expected in quality.

I then switched to using my own data set which contains about 4500 samples between 2-15 seconds of speech. After 250 epochs, however, performing an inference then produces only white noise or static. I've looked at my data set to ensure it was being read ok, and the encoding was ok, and all seems well that I can see.

I wondered whether anyone could help either if they have found this similar type of problem before, or if they could suggest how I could go about evaluating the Tacotron2 and WaveGlow models.

One thing that does cause me some confusion, is that even for the LJSpeech dataset, the training loss of the WaveGlow model seems to be negative. Is this to be expected?

Many thanks

jkw1jkw1 commented 2 years ago

After some more investigation I can see the the validation loss of the Tacotron2 model doesn't seem to improve after roughly 3-5 epochs.

Does anyone have any suggestions?