Open rhjdzuDL opened 1 year ago
Thank you for the comments.
We are sorry for missing the averaging term $\frac{1}{N-1}$ for $\theta{h,0}$ in the manuscript. $A{i,j}^{l,h}$ can be viewed as the attention value in the $i$-th column of the $j$-th row. We are sorry for this confusion.
The attention score in Eq. 8 combined with the CLS attention and the other attention. The CLS token is the final output used for classification in ViT, which is relatively more important than the other tokens. Thus, the CLS attention is separated from the attention matrix as a term and the average of the other attention is another term in the attention score. Considering that the head with denser and larger values is more important, the CLS attention and the other attention in different heads are weighted averaged by head-weights. For CLS attention, the head in which the correlation between CLS token and other tokens is higher should be more important, so the head-weight for CLS attention is computed as $\theta{h,0}$. For other tokens, the head with a large $A{0,0}^{l,h}$, i.e. the CLS token in Q is high correlated to that in K, has a more robust CLS token and is thus regarded to be more important. Hence, $\theta{h,0}$ = $A{0,0}^{l,h}$ is the head-weight for the other attention.
Great work! It is very interesting to handle ViT compression from the perspective of low-frequency components.
I feel a little confused about the attention score (Eq. 8). The definition looks obscure and lacks necessary explanation. I have read the code (and know what it does), but still cannot get the intuition. So, what exactly do these operations mean? https://github.com/Daner-Wang/VTC-LFC/blob/cfea7ee95a9a1f6aa2d2af512d312646eee75577/models/deit/deit.py#L41-L46
By the way, there seems to be two mistakes in the paper compared with the code: