YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

Attention maps for model explainability #59

Closed kremHabashy closed 2 years ago

kremHabashy commented 2 years ago

Hello Yuan,

I was looking into the ViT paper for some baseline results implementation, and found their section on attention maps. This is very helpful for model explainability as it shows the sections that lead to the classification result of a given image. I am currently working on an audio based task, and so look to this model as its fine-tuning on audioset is very useful to me. I was wondering if there was an equivalent implementation in the AST model to show what part of the filter banks leads to a given classification. If it is not currently in place, how would you recommend I go about implementing something like that?

Thank you, Karim

YuanGongND commented 2 years ago

Hi there,

I don't have visualization code to release now, but I am quite sure it can be done in the same way with ViT. If you are familiar with Transformer, it is not hard to implement.

-Yuan