YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

Convert mel filterbanks to wav again? #50

Closed clairerity closed 2 years ago

clairerity commented 2 years ago

First of all, thanks for this wonderful repo! I am just curious if it is possible to convert the mel input back to wav again? I am trying out a model that will use the same concept as yours as a transformer decoder input but am just not sure if the predicted output (also in mel form) can be converted back to mel. Thank you very much in advance!

YuanGongND commented 2 years ago

Hi there,

First, the output for each patch of the Transformer encoder is a 768-dimensional vector, which is not in the same shape as the spectrogram patch 16x16 (256-dimensional), so you need at least a linear layer to map 768 to 256, and you need to think of which loss to add on this linear layer.

Second, I don't know much about mel-waveform conversion, but a quick search found this: https://jumpml.com/howto-invert-logmel/output/, so at least it is not impossible. Nevertheless, the spec patches we use are 16x16 and overlapped.

Third, in our recent work, we actually implemented a method to force the Transformer model to recover the input spectrogram here. This is for SSL learning, and the target to recover is the 768-dimension patch embedding, but we have also tested to recover the 256-dimension spectrogram patch. For details, please read our paper https://arxiv.org/abs/2110.09784.

Hope these help.

-Yuan