NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.1k stars 1.39k forks source link

max decoder steps #575

Open Raha304 opened 2 years ago

Raha304 commented 2 years ago

Hi guys,

I'm trying to train a speech encoder whose output was similar to Tacotron2's encoder output with teacher-student training. So after it is trained, I can have a speech encoder whose input is audio and not text. Now I trained this speech encoder using adversarial, MSE, CTC, and CE losses for 200k steps and the LibriSpeech dataset. But after alternating this speech encoder with Tacotron2's encoder in the pre-trained model, my output is like this and gets Warning Reached max decoder steps. I get a line for attention alignments but after the actual decoder step, there is noisy speech to the end of the synthesized voice. What is the reason for this problem? Why the decoder doesn't recognize the actual step in inference?

If anyone can help me, I am very thankful for this favor.

Capture

ndz2011 commented 7 months ago

Hi Raha, have you solved the problem? I solved the same problem by setting the gate-threshold from 0.5 to 0.1