Comparing mel spectrogram with postnet mel

I'm trying to comprehend how does deep learning in speech synthesis work. The plots below are generated using Tacotron 2 after speech synthesis on my own pretrained model. Left one is typical mel spectrogram, middle one is same but run with postnet and the 3rd one is just alignment graph. Anyone could explain how can I interpret this? I know that mel spectrogram is somewhat a graphic explenation of sound and the one on the left kinda visuals what is synthesizer saying. The line on the right implies that algorithm had no bigger issues with synthesizing text. What about the middle one? Can we interpret this like that: The algorithm may have had some issues with sound extrapolation which is kinda true because the voice is comprehensial but metalic?

white

NVIDIA / tacotron2

Comparing mel spectrogram with postnet mel #570