Open XFeiF opened 3 years ago
This paper aims to explore deeper vision transformer models. They find that as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers, which is named as attention collapse. They developed a simple yet effective method, named Re-attention, to re-generate the attention maps calculated by in multi-head self-attention(MHSA) module to increase their diversity at different layers with negligible computation and memory cost.
Concretely, they use the attention maps from the heads as a basis and generate a new set of attention maps by dynamically aggregating them. The core equation is shown below. where transformation matrix Θ is multiplied to the self-attention map A along the head dimension.
It is not a new problem in the computer vision field as we have similar ideas of exploring deeper CNNs like residual learning.
But it is a 'new' problem in learning deeper vision transformer models. It is important because of two reasons I think. First, it is the first work about making transformers deeper. Second, it tries to solve this problem by the proposed simple yet effective Re-attention module. I think this will be a good start of exploring effective while deeper transformers.
Q: Why directly scaling the depth of ViT by stacking more transformer blocks cannot monotonically increase the
performance?
A: Attention collapse. They find that the attention maps, used for aggregating the features for each transformer block, tend to be overly similar after certain layers.
Q: How to solve this problem?
A: Re-Attention. The Re-attention takes advantage of the MHSA structure and regenerates attention maps by exchanging the information from different attention heads in a learnable manner.
ViT, ResNet
Transformer-alike models have modularized architectures and thus can be easily made deeper by repeating the basic transformer blocks or using larger embedding dimensions. However, those strategies only work well with larger datasets and stronger augmentation policies to alleviate the brought training difficulties.
(Larger datasets? Stronger augmentation policies? Not exactly what the key difference between Re-Attention and other works. Maybe this paper is the very first batch of papers related to this issue.)
This paper proposes the re-attention module to address the difficulties in scaling vision transformers.
ImageNet. Code will be released after supplementary material ddl.
I think yes. See questions 3 and 6.
Paper
[Code] Not available now ~
Authors: Daquan Zhou, Bingyi Kang, etc.