NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.06k stars 1.38k forks source link

Comparing mel spectrogram with postnet mel #570

Open janki3l opened 2 years ago

janki3l commented 2 years ago

I'm trying to comprehend how does deep learning in speech synthesis work. The plots below are generated using Tacotron 2 after speech synthesis on my own pretrained model. Left one is typical mel spectrogram, middle one is same but run with postnet and the 3rd one is just alignment graph. Anyone could explain how can I interpret this? I know that mel spectrogram is somewhat a graphic explenation of sound and the one on the left kinda visuals what is synthesizer saying. The line on the right implies that algorithm had no bigger issues with synthesizing text. What about the middle one? Can we interpret this like that: The algorithm may have had some issues with sound extrapolation which is kinda true because the voice is comprehensial but metalic?

white

p0p4k commented 2 years ago

Left one is real melspec generated from the sound. Middle is melspec predicted by the algorithm and last one is alignment of the sentence to wav. I can tell you about it in lot of detail of you want me to. Hit me up on matrix , my ID is @p0p4k:matrix.org .