question about visual attention map

Hi @fawazsammani.

First of all thank you once again for providing the tutorial for single image usage.

I was playing around with the model and I am curious about one thing. Together with the textual explanation, we can also have the visual attention explanation with the attention map. So, I was wondering if this visual explanation is computed considering the entire sentence output (= answer + explanation). It is possible to divide the visual attention map in due different images: one that focuses on the classification part (answer) and another one that focuses on the explanation part. So in the first one I will have highlighted areas that are important for the answer prediction and in the second one it will be shown only the most important parts for explanations only.

Sorry for my lack of knowledge regarding this problem. Thank you for your time. Best wishes.

Hi, Yes of course you are able to get them. We just provided the visual explanation for the answer in the demo, but you can have a visual explanation for every word (answer + explanation).

The line of code here: last_am = xa_maps[-1].mean(1)[0, question_len:] in app.py means: take attention map of the last layer, average the attention heads, and trim it so that is starts from the answer up until the end (end of explanation). Then this line of code: mask = last_am[0, :] means take the attention map for the answer (index 0). If you want the visual explanation for all tokens, then you can iterate over last_am. So last_am[1, :] gives you the cross attention map for the word "because", last_am[2, :] gives you the cross attention map for the first explanation word, and so on....

Hi @fawazsammani

Ohhh, I see. I think I get it now. Thanks a lot for the help.

Have an amazing day!

Hi @fawazsammani.

Sorry for bothering you again. I've been looking at the output for some images for both VQA_X and ACT_X.

In the case of VQA_X, before visualizing the attentions, you do:

last_am = xa_maps[-1].mean(1)[0, question_len:]

to trim it so that is starts from the answer up until the end (end of explanation).

As an example, let's suppose we have an image (image id =018680084.jpg) and we ask the following question "Is it raining?" and the system responds with "no because the ground is covered in snow". This sentence has 8 words so I expect last_am to have a shape of 8. In this case the shape of _xamaps is torch.Size([1, 12, 15, 196]) and _lastam has shape torch.Size([8, 196]), which is what I expected for _lastam. But how is the dimension 15 in _xamaps obtained? In this case the question has size 3, the response has size 8, so we obtain 3+8 = 11. What I am missing here? [later edit: in this case the question_len is 7( question_len = len(tokenizer.convert_ids_to_tokens(input_ids[0]))), so last_am has a size of 8 because we skip the first 7 positions and go to the answer part], so 15-7 = 8, which is the size of the output sentence].

But, if I use the same image and I change the question to "what is he doing here?" the response is "skiing because he is on a snow covered hill with skis on his feet" which has 14 words. However, in this case the shape of _xamaps is torch.Size([1, 12, 24, 196]) and _lastam is torch.Size([15, 196]). [later edit: in this case the question_len is 9, so we obtain 24-9 = 15 which is the size of last_am]. But then we have only 14 words in the sentence output. So there is a mismatch between the words and the last_am attention tensor. I can't understand why this is happening.

Something similar happens for the ACTX demo as well. Given an image (085226387.jpg), the response from the system is "ballroom', 'because he is standing on a stage with a partner and dancing with a partner'"_. In this case we have 16 words, and the shape of _xamaps is torch.Size([1, 12, 21, 196]), and the shape of _lastam is torch.Size([21, 196]). So it seems that I have more attentions maps than words.
[later edit: Please correct me if I am wrong, but shouldn't we do something like question_len = len(tokenizer.convert_ids_to_tokens(input_ids[0])), last_am = xa_maps[-1].mean(1)[0, question_len:] also for the ACT_X demo? I think this would remove the part with " the answer is". But even if I do this, I will still get a bigger size of _lastam than the number of words in the output sentence: i will have _xamaps with shape torch.Size([1, 12, 21, 196]) and _lastam with shape torch.Size([17, 196]) but 16 words in the output sentence.]

For another image (026558760.jpg), the response is "'yard work', 'because he is standing in a yard and using a mop to cut through a tree trunk'". So I have _xamaps with shape torch.Size([1, 12, 24, 196]) and _lastam with shape torch.Size([24, 196]), but the response has 19 words. In this case, is "yard work" considered as single word/token? If yes, then we have a sentence with 18 words and have mismatch between the length of the response and the dimension of _lastam attention tensor. In this case of ACT_X, will the mask = last_am[0, :] still give me the attention map for the answer (index 0), last_am[1, :] for the word "because", and so on if we don't remove the inputs_ids part?

I can't figure it out what is that I'm missing here. I would greatly appreciate any insights and explanation.

I apologize once again for bothering you and for the long post. Thank you for your time and patience.

After analyzing a bit more the code I think I understand where the inconsistency between the attention maps shape and number of words comes from. Just my two cents, so please correct me if I am wrong.

It seems like the model returns an attention map for each token generated, so basically the shape of the attention tensor is consistent with the number of tokens in input_ids. After the generation of tokens finishes, inputs_ids has the following structure: inputs_ids = "endoftext the answer is answer because explanation", and then only the part in current_output= "answer because explanation" is being decoded and returned. Given the type of tokenizer used, complex/rare words are represented by multiple tokens (like for example ballroom = Ġball + room, skis = Ġsk + is). So during the decoding phase more tokens will be used to return complex words, so the number of words in the final sentence can be different from the number of tokens. Hence the difference in shape size.

Therefore, iterating over the last_am = xa_maps[-1].mean(1)[0, question_len:], will give me the visual explanation for all tokens (which @fawazsammani kindly explained already), but we don't always have a 1-to-1 relationship between tokens and words, since more complex words are composed from multiple tokens (which was the part that I missed).

Moreover, for the ACT_X visual explanation output, I think the attention maps of the "endoftext the answer is" part should be excluded in order to start from the attention map of the "answer", like in the case of VQA_X. So maybe doing something like this (@fawazsammani what do you think? Is this correct?):

input_len = len(tokenizer.convert_ids_to_tokens(input_ids[0])) last_am = xa_maps[-1].mean(1)[0, input_len:] and then iterate over last_am to get the visual explanations for all tokens of the "answer because explanation" part.

In the case of complex words, how do we manage/interpret the visual explanation? If I take the example with "ballroom" I will have one visual map for the "ball" word and one for the "room", which separately don't really have he same meaning as ballroom. Room is more similar with ballroom, but ball doesn't quite fit the meaning here.

@fawazsammani What is your opinion on this?

Thank you. Sorry once again for the long post. Have a great day.

Hi @Ellyuca Thank you for taking the time to write your question and understanding the code. I will read your 2 questions and reply you tonight.

Regards

Hi again @Ellyuca

Therefore, iterating over the last_am = xa_maps[-1].mean(1)[0, question_len:], will give me the visual explanation for all tokens (which @fawazsammani kindly explained already), but we don't always have a 1-to-1 relationship between tokens and words, since more complex words are composed from multiple tokens (which was the part that I missed).

Yes that's correct. This will give you the cross-attention maps for the byte-pair encoded words. This in turn comes from the tokenized words.

In the case of complex words, how do we manage/interpret the visual explanation? If I take the example with "ballroom" I will have one visual map for the "ball" word and one for the "room", which separately don't really have he same meaning as ballroom. Room is more similar with ballroom, but ball doesn't quite fit the meaning here.

In that case, you can choose to average them together, or make your life simple and choose one of them. They usually should portray the same thing (if not then they should be somehow similar). Taking your example, when the model generates "ball", it knows that the next word to be generated is "room" (its already trained). So the visual attention map will already correspond to "room" or in general "ballroom". Maybe i'm wrong on this. I'm not exactly sure if this thing is discussed in the literature. Usually people just ignore it.

Moreover, for the ACT_X visual explanation output, I think the attention maps of the "endoftext the answer is" part should be excluded in order to start from the attention map of the "answer", like in the case of VQA_X. So maybe doing something like this (@fawazsammani what do you think? Is this correct?):

Yes! You are correct. This is a small mistake in the code that I did not pay attention to. Thanks a lot for pointing this out. But to verify point 2 above, you can see that the visual attention maps for "the" (since we are taking the first index of xa_maps) already corresponds to the correct answer, even though the answer is not yet generated. That verifies the fact that the model already knows future words.

Finally, if you are looking into model explainability, you shouldn't really trust attention maps that much. In some cases, they are very noisy and don't even correspond to "why" the model made such as decision. Instead, you may look at explainable ai techniques such as Grad-CAM, Integrated Gradients, Saliency Methods, Occlusion, LIME...etc. The captum library offers an efficient and easy way to run them.

Please let me know if you have any other questions.

Regards Fawaz

Hi @fawazsammani. Thank you for taking time to answer my questions. It really helped clarifying some of my doubts. Just a few final thoughts on this discussion.

Finally, if you are looking into model explainability, you shouldn't really trust attention maps that much. In some cases, they are very noisy and don't even correspond to "why" the model made such as decision.

Looking at some examples, the attention maps are indeed very noisy and don''t make much sense sometimes. Basically, I wanted to see what was the relationship between the generated words and the most highlighted parts of the attention map and if this relationship is somehow aligned with a human understanding/judgement.

For example, given the image 017829326.jpg (a woman playing the violin), for ACT_X we get the following output: violin because she is sitting in a chair holding a violin in her hands. original_violin

It seems like the attention map at index 0 (endoftext) focuses on parts such as the face/neck, the violin, which makes sense to the predicted answer. Same for the index 3 (is) where the attention is mostly on the violin part. But then when we move to index 4 (violin) the most highlighted areas is the neck of the woman and a corner of the violin case. I would except the attention map index 4 (violin) to be more like the one of the index 3 or even index 0. attn_map_224x224x3_index_0

Analyzing other examples as well, it seems that the first attention map (index 0) points to the parts of the image used for the generation of the entire sentence. It looks like a sort of "summary" of the output, as you also mentioned something similar above:

you can see that the visual attention maps for "the" (since we are taking the first index of xa_maps) already corresponds to the correct answer, even though the answer is not yet generated.

Sometimes, there is a matching between the attention map and its corresponding token, in the sense that the attention map points to the area in the image where that exact object/attribute/etc can be found. But for most of the examples I analyzed this is not the case. So I agree with you on this:

don't even correspond to "why" the model made such as decision

Thank you once again for all your patience and time. Best wishes.

@Ellyuca Yes exactly. Attention maps are a very bad way of explaining model decisions. Why they are still widely used is because they are model-intrinsic. They come naturally with the model so people don't have to implement these post-hoc techniques to explain their model. But actually many papers show that attention maps are meaningless and should not be taken too seriously. But again, they may work very well in some cases.

I would like to thank you for the time you spent in understanding this code and every piece of it. Probably you now understand it more than I do, and if I forgot what I wrote then I can revert back to you :). And thank you also for the mistake you figured out.

If you have other questions (not only on this code, but anything related in general), please reach out to me via email: fawaz.sammani@vub.be

Have a great day!

fawazsammani / nlxgpt

question about visual attention map #2