I have a question about the inference architecture of Tacotron2.
I know the paper use the mel of decoder output as the prenet input.
But Why not use the final mel output (decoder output combined with postnet output, which I think is closer to the ground truth mel) as the prenet input?
I have a question about the inference architecture of Tacotron2. I know the paper use the mel of decoder output as the prenet input. But Why not use the final mel output (decoder output combined with postnet output, which I think is closer to the ground truth mel) as the prenet input?