I'm trying to train a speech encoder whose output was similar to Tacotron2's encoder output with teacher-student training. So after it is trained, I can have a speech encoder whose input is audio and not text.
Now I trained this speech encoder using adversarial, MSE, CTC, and CE losses for 200k steps and the LibriSpeech dataset. But after alternating this speech encoder with Tacotron2's encoder in the pre-trained model, my output is like this and gets Warning Reached max decoder steps.
I get a line for attention alignments but after the actual decoder step, there is noisy speech to the end of the synthesized voice.
What is the reason for this problem? Why the decoder doesn't recognize the actual step in inference?
If anyone can help me, I am very thankful for this favor.
Hi guys,
I'm trying to train a speech encoder whose output was similar to Tacotron2's encoder output with teacher-student training. So after it is trained, I can have a speech encoder whose input is audio and not text. Now I trained this speech encoder using adversarial, MSE, CTC, and CE losses for 200k steps and the LibriSpeech dataset. But after alternating this speech encoder with Tacotron2's encoder in the pre-trained model, my output is like this and gets Warning Reached max decoder steps. I get a line for attention alignments but after the actual decoder step, there is noisy speech to the end of the synthesized voice. What is the reason for this problem? Why the decoder doesn't recognize the actual step in inference?
If anyone can help me, I am very thankful for this favor.