I was looking into the ViT paper for some baseline results implementation, and found their section on attention maps. This is very helpful for model explainability as it shows the sections that lead to the classification result of a given image. I am currently working on an audio based task, and so look to this model as its fine-tuning on audioset is very useful to me. I was wondering if there was an equivalent implementation in the AST model to show what part of the filter banks leads to a given classification. If it is not currently in place, how would you recommend I go about implementing something like that?
I don't have visualization code to release now, but I am quite sure it can be done in the same way with ViT. If you are familiar with Transformer, it is not hard to implement.
Hello Yuan,
I was looking into the ViT paper for some baseline results implementation, and found their section on attention maps. This is very helpful for model explainability as it shows the sections that lead to the classification result of a given image. I am currently working on an audio based task, and so look to this model as its fine-tuning on audioset is very useful to me. I was wondering if there was an equivalent implementation in the AST model to show what part of the filter banks leads to a given classification. If it is not currently in place, how would you recommend I go about implementing something like that?
Thank you, Karim