Convert mel filterbanks to wav again?

Hi there,

First, the output for each patch of the Transformer encoder is a 768-dimensional vector, which is not in the same shape as the spectrogram patch 16x16 (256-dimensional), so you need at least a linear layer to map 768 to 256, and you need to think of which loss to add on this linear layer.

Second, I don't know much about mel-waveform conversion, but a quick search found this: https://jumpml.com/howto-invert-logmel/output/, so at least it is not impossible. Nevertheless, the spec patches we use are 16x16 and overlapped.

Third, in our recent work, we actually implemented a method to force the Transformer model to recover the input spectrogram here. This is for SSL learning, and the target to recover is the 768-dimension patch embedding, but we have also tested to recover the 256-dimension spectrogram patch. For details, please read our paper https://arxiv.org/abs/2110.09784.

Hope these help.

-Yuan

YuanGongND / ast

Convert mel filterbanks to wav again? #50