YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.17k stars 221 forks source link

Using mel spectrogram with different number of bins #76

Closed Yuhan-Shen closed 2 years ago

Yuhan-Shen commented 2 years ago

Hi,

Thanks for sharing your great work! In the ast_models.py, we can change the input_tdim to make the model applicable for audios with a different duration from the pretrained model. I am curious if it is possible to make it also suitable to input data with different number of mel bins (e.g. 64 rather than 128 in the paper). Maybe we can also do interpolation on frequency domain? I would like to know if it is doable and reasonable.

YuanGongND commented 2 years ago

Hi Yuhan,

Thanks for the question.

I am curious if it is possible to make it also suitable to input data with different number of mel bins (e.g. 64 rather than 128 in the paper).

With a fixed sampling rate (16kHz in our setting), using a larger number of bins could lead to a noticeable performance improvement for audio tagging (see PANNs paper Table X) at a higher computational cost.

I think it is reasonable to use a different number of bins and do something similar to what we did for time interpolation. It won't be optimal, but my guess is it is still better than no pretraining. Nevertheless, it is also more complex. E.g., if your pretraining and fine-tuning sampling frequency are the same, but just want to use a different number of bins in fine-tuning, you should use interpolation, probably not in a linear scale. If your pretraining sampling frequency is higher than fine-tuning sampling frequency and you want to adjust the bins, then you should use cutting, etc.

With this complexity, I think the best way might be to pretrain the model again with desired bins and sampling frequency and make them consistent in pretraining and fine-tuning. The code in this repo only supports time adjustments right now.

-Yuan

Yuhan-Shen commented 2 years ago

Hi Yuhan,

Thanks for the question.

I am curious if it is possible to make it also suitable to input data with different number of mel bins (e.g. 64 rather than 128 in the paper).

With a fixed sampling rate (16kHz in our setting), using a larger number of bins could lead to a noticeable performance improvement, see PANNs paper Table X at a higher computational cost.

I think it is reasonable to use a different number of bins and do something similar to what we did for time interpolation. It won't be optimal, but my guess is it is still better than no pretraining. Nevertheless, it is also more complex. E.g., if your pretraining and fine-tuning sampling frequency are the same, but just want to use a different number of bins in fine-tuning, you should use interpolation, probably not in a linear scale. If your pretraining sampling frequency is higher than fine-tuning sampling frequency and you want to adjust the bins, then you should use cutting, etc.

With this complexity, I think the best way might be to pretrain the model again with desired bins and sampling frequency and make them consistent in pretraining and fine-tuning. The code in this repo only supports time adjustments right now.

-Yuan

Thanks for your detailed answer. very informative!