Closed clairerity closed 2 years ago
Hi there,
First, the output for each patch of the Transformer encoder is a 768-dimensional vector, which is not in the same shape as the spectrogram patch 16x16 (256-dimensional), so you need at least a linear layer to map 768 to 256, and you need to think of which loss to add on this linear layer.
Second, I don't know much about mel-waveform conversion, but a quick search found this: https://jumpml.com/howto-invert-logmel/output/, so at least it is not impossible. Nevertheless, the spec patches we use are 16x16 and overlapped.
Third, in our recent work, we actually implemented a method to force the Transformer model to recover the input spectrogram here. This is for SSL learning, and the target to recover is the 768-dimension patch embedding, but we have also tested to recover the 256-dimension spectrogram patch. For details, please read our paper https://arxiv.org/abs/2110.09784.
Hope these help.
-Yuan
First of all, thanks for this wonderful repo! I am just curious if it is possible to convert the mel input back to wav again? I am trying out a model that will use the same concept as yours as a transformer decoder input but am just not sure if the predicted output (also in mel form) can be converted back to mel. Thank you very much in advance!