Interpretation of attention score

Thank you for the comments.

We are sorry for missing the averaging term $\frac{1}{N-1}$ for $\theta{h,0}$ in the manuscript. $A{i,j}^{l,h}$ can be viewed as the attention value in the $i$-th column of the $j$-th row. We are sorry for this confusion.

The attention score in Eq. 8 combined with the CLS attention and the other attention. The CLS token is the final output used for classification in ViT, which is relatively more important than the other tokens. Thus, the CLS attention is separated from the attention matrix as a term and the average of the other attention is another term in the attention score. Considering that the head with denser and larger values is more important, the CLS attention and the other attention in different heads are weighted averaged by head-weights. For CLS attention, the head in which the correlation between CLS token and other tokens is higher should be more important, so the head-weight for CLS attention is computed as $\theta{h,0}$. For other tokens, the head with a large $A{0,0}^{l,h}$, i.e. the CLS token in Q is high correlated to that in K, has a more robust CLS token and is thus regarded to be more important. Hence, $\theta{h,0}$ = $A{0,0}^{l,h}$ is the head-weight for the other attention.

Daner-Wang / VTC-LFC

Interpretation of attention score #1