Open janki3l opened 2 years ago
Left one is real melspec generated from the sound. Middle is melspec predicted by the algorithm and last one is alignment of the sentence to wav. I can tell you about it in lot of detail of you want me to. Hit me up on matrix , my ID is @p0p4k:matrix.org .
I'm trying to comprehend how does deep learning in speech synthesis work. The plots below are generated using Tacotron 2 after speech synthesis on my own pretrained model. Left one is typical mel spectrogram, middle one is same but run with postnet and the 3rd one is just alignment graph. Anyone could explain how can I interpret this? I know that mel spectrogram is somewhat a graphic explenation of sound and the one on the left kinda visuals what is synthesizer saying. The line on the right implies that algorithm had no bigger issues with synthesizing text. What about the middle one? Can we interpret this like that: The algorithm may have had some issues with sound extrapolation which is kinda true because the voice is comprehensial but metalic?