Hi,
Thank you for your work. I noticed a mismatch between your implementation and the paper in the postnet. In the paper postnet is conditioned on the decoder input and the mel-decoder output while in the implementation you condition the postnet on the decoder input and other conditions (speaker, emotion and the prosody). You don't condition the output of the transformer decoder. Is there any reason for this mismatch?
Hi, Thank you for your work. I noticed a mismatch between your implementation and the paper in the postnet. In the paper postnet is conditioned on the decoder input and the mel-decoder output while in the implementation you condition the postnet on the decoder input and other conditions (speaker, emotion and the prosody). You don't condition the output of the transformer decoder. Is there any reason for this mismatch?
Kind regards