Attention maps for model explainability

Hello Yuan,

I was looking into the ViT paper for some baseline results implementation, and found their section on attention maps. This is very helpful for model explainability as it shows the sections that lead to the classification result of a given image. I am currently working on an audio based task, and so look to this model as its fine-tuning on audioset is very useful to me. I was wondering if there was an equivalent implementation in the AST model to show what part of the filter banks leads to a given classification. If it is not currently in place, how would you recommend I go about implementing something like that?

Thank you, Karim

YuanGongND / ast

Attention maps for model explainability #59