Finetuning attention maps

ChangwenXu98 / TransPolymer

Implementation of "TransPolymer: a Transformer-based language model for polymer property predictions" in PyTorch

MIT License

53 stars 19 forks source link

Finetuning attention maps #10

Closed himanshisyadav closed 12 months ago

himanshisyadav commented 12 months ago

I was wondering how to determine which tokens have a higher attention score than the others. In short, how do you arrive at the red highlights in Figure 6 of the paper? How do you aggregate the attention scores from all the 12 attention heads in order to come to this conclusion?

ChangwenXu98 commented 12 months ago

Hi @himanshisyadav, Fig 6 is about the attention scores between the first special token with other tokens in the sequence. The attention scores could be obtained from the transformer encoder. You can refer to Attention_vis.py for implementation. The attention scores are not aggregated; the scores from each head are presented in the corresponding subfigure of Fig. 6.

himanshisyadav commented 12 months ago

Hi @ChangwenXu98, I guess I didn't pose a very good question. I wanted to know how you identify which token in the sequence is being attended to the most. For example, the red highlights in the text at the bottom say that '$F' and '-23' have the highest attention scores. I have been able to produce the attention maps but I am not sure how to produce the text part with the highlighted tokens in Figure 6 once I have 12 attention maps . Can you point me to the line in Attention_vis.py which is doing that? I don't think there is such a line.

ChangwenXu98 commented 12 months ago

Hi @himanshisyadav, thanks for clarifying your question. Since you already know what the sequence is, you can tell from each figure what the indices are for those high attention scores, thus knowing what tokens those indices correspond to. The first special token will not attend to exactly the same tokens at different attention heads in different hidden layers. But in the example in Fig 6, we can tell that some tokens are attended to multiple times in different attention heads. This phenomenon is interesting and it may suggest that the transformer is well-trained to attend to the important tokens in the sequence.

himanshisyadav commented 12 months ago

Hi @ChangwenXu98 Got it! So, it's more of a qualitative analysis. I was trying to get the max from each attention head and then the max of all attention heads' maxes. Thank you for your clarification! Maybe I should look a little more in the literature to see if others have a more quantitative strategy.

Thank you again!

himanshisyadav commented 12 months ago

Thank you! I'll close the issue!