NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
13.59k stars 3.24k forks source link

about Tacotron2 Speech Synthesis #1273

Open naveens01 opened 1 year ago

naveens01 commented 1 year ago

Hi, I've been looking at the PyTorch/SpeechSynthesis/Tacotron2 model in this repo, and I trained first with one of the subsets of the LJSpeech-1.1 to test whether my setup would work to train the models. I only ran it for 100 epochs and it gave results which were expected in quality.

I then switched to using my own data set which contains about 4500 samples between 2-15 seconds of speech. After 250 epochs, however, performing an inference then produces only white noise or static. I've looked at my data set to ensure it was being read ok, and the encoding was ok, and all seems well that I can see.

I wondered whether anyone could help either if they have found this similar type of problem before, or if they could suggest how I could go about evaluating the Tacotron2 and WaveGlow models.

One thing that does cause me some confusion, is that even for the LJSpeech dataset, the training loss of the WaveGlow model seems to be negative. Is this to be expected?

Many thanks

CentralMatthew commented 8 months ago

@naveens01 Hi, did you found the solution? Have similar case, when i need to try on own custom data set, but it will produce white noise