YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

Question regarding fbank for fine tuning #52

Closed kremHabashy closed 2 years ago

kremHabashy commented 2 years ago

Hi Yuan,

Than you for this great work! I am currently fine tuning the models you produced for a project I am working on and really appreciate the opportunity you created for me. I had a question regarding the spectrograms (or fbanks) produced by the wav2vec function.

Currently, I am trying to prepare a dataset to match the requirements of the model but have stumbled upon something that grabs my attention: You have mentioned in the paper that the model acceps variable inputs. Taking a closer look, I have found that this is due to the added padding below the fbank, his is done to fix the input dimensions into the model. However, when I applied this on my own data, I saw that the padding was of different colors depending on the image when I converted them. Here are two examples: d4-2 wav, d10-2 wav Although I am aware that the values of the solid coloured areas are zeros, I worry that this is indicative of the same colour being attributed to a different value in different spectrograms and how that would imapt the understanding of the model of colour.

My second question is regarding the use of padding specifically. In the ViT paper as well as AST, images are fed through as a colelction of patches for learning. Any patches that are fully blank naturally would not be adding too much information to the model. However, for the patches that have an overlap of fbank and spectrogram, is there no effect on learning there? Also, if a specific category is relatively shorter in length to another, does the model include that audio file length in its representation of that class?

Any insight on the above would be deeply appreciated. Thanks again

YuanGongND commented 2 years ago

Hi there,

For the first question, we require a fixed input length for each task, but the length doesn't have to be 10s (the one for AudioSet). We use 5s for ESC-50 and 1s for Speechcommands, they are all fine (please check our recipe). You just need to specify input_tdim as the number of frames (audio length * 100) when you instantiate the AST model. In general, you can use the mean/max audio length of your task for input_tdim. For your example, probably the appropriate number is 100 or 200. When you set input_tdim=200, the script cut/pad your audios to 200 frames instead of 1000 frames. Further, even if the model is pretrained with AudioSet with input_tdim=1000, it can be transferred to your task with input_tdim=200, the trick is positional embedding adaptation (handled automatically by our code, just set imagenet_pretrain=True and audioset_pretrain=True). In our ESC-50 recipe, we find AudioSet pretraining leads to ~6% improvement even when input_tdim is not consistent in pretraining and fine-tunning.

Setting appropriate input_tdim not only makes learning simpler but also significantly improves the computational efficiency (AST is O(n^2) where n is the length of the input).

ASTModel(label_dim=527, \
         fstride=10, tstride=10, \
         input_fdim=128, input_tdim=1024, \
         imagenet_pretrain=True, audioset_pretrain=False, \
         model_size='base384')

For the second question, can you explain more on "for the patches that have an overlap of fbank and spectrogram, is there no effect on learning there?"? I think AST also works quite well without overlap, overlapping can improve the performance but not dramatically.

Also, if a specific category is relatively shorter in length to another, does the model include that audio file length in its representation of that class?

Yes, and we should avoid model makes predictions based on such nuisance factors. But that is not a problem of AST, but the dataset preparation. I think other DNN models will be impacted.

-Yuan

kremHabashy commented 2 years ago

Hi Yuan,

Thank you for the quick reply. No further issues with the first question. The insight on input tdim is greatly appreciated. For the second question, my apologies, I did not mean mix between fbank and spectrogram, rather a mix of fbank and padded 0's (a patch containing both fbank information and blank space). When these patches are fed into the model for training, is there no impact on learning from that end?

The third question has also been adressed.

Thank you

kremHabashy commented 2 years ago

Also, could you comment on the reasoning of the padded regions in the two images above being different colours despite using the same padding and the implications this may have on learning?

Thanks again

YuanGongND commented 2 years ago

I think using the mixture of padding and valid spectrogram is fine as padding can be viewed as silence. The different color is due to the matplotlib's colormap setting, if you look at the values of the fbank matrix, they should be the same.

kremHabashy commented 2 years ago

Hi Yuan, thank you for your answers, this can be closed now.