21ICCV # DeepViT: Towards Deeper Vision Transformer

Ten Questions

1. What is the problem addressed in the paper?

This paper aims to explore deeper vision transformer models. They find that as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers, which is named as attention collapse. They developed a simple yet effective method, named Re-attention, to re-generate the attention maps calculated by in multi-head self-attention(MHSA) module to increase their diversity at different layers with negligible computation and memory cost.

Concretely, they use the attention maps from the heads as a basis and generate a new set of attention maps by dynamically aggregating them. The core equation is shown below. where transformation matrix Θ is multiplied to the self-attention map A along the head dimension.

2. Is this a new problem?

It is not a new problem in the computer vision field as we have similar ideas of exploring deeper CNNs like residual learning.
But it is a 'new' problem in learning deeper vision transformer models. It is important because of two reasons I think. First, it is the first work about making transformers deeper. Second, it tries to solve this problem by the proposed simple yet effective Re-attention module. I think this will be a good start of exploring effective while deeper transformers.

3. What is the scientific hypothesis that the paper is trying to verify?

Q: Why directly scaling the depth of ViT by stacking more transformer blocks cannot monotonically increase the performance?
A: Attention collapse. They find that the attention maps, used for aggregating the features for each transformer block, tend to be overly similar after certain layers.
Q: How to solve this problem?
A: Re-Attention. The Re-attention takes advantage of the MHSA structure and regenerates attention maps by exchanging the information from different attention heads in a learnable manner.

4. What are the key related works and who are the key people working on this topic?

ViT, ResNet

5. What is the key to the proposed solution in the paper?

Transformer-alike models have modularized architectures and thus can be easily made deeper by repeating the basic transformer blocks or using larger embedding dimensions. However, those strategies only work well with larger datasets and stronger augmentation policies to alleviate the brought training difficulties.

(Larger datasets? Stronger augmentation policies? Not exactly what the key difference between Re-Attention and other works. Maybe this paper is the very first batch of papers related to this issue.)
This paper proposes the re-attention module to address the difficulties in scaling vision transformers.

6. How are the experiments designed?

Pilot study on ImageNet: investigate how the performance of ViT changes with increased model depth.
As the number of transformer blocks increases, the model performance does not improve accordingly.
Self-Attention: investigate how the generated attention map A varies as the model goes deeper, as self-attention mechanism plays a key role in ViTs.
Attention collapse.
When increasing the embedding dimension of each token (from 256 to 768), the number of blocks with similar attention maps is reduced and the attention collapse is alleviated. But it increases the computation cost significantly, which is practicable.
Re-Attention v.s. Self-Attention.
Comparison to adding temperature in self-attention.
Comparison to dropping attentions.
Comparison with other SOTA models.

7. What datasets are built/used for the quantitative evaluation? Is the code open-sourced?

ImageNet. Code will be released after supplementary material ddl.

8. Is the scientific hypothesis well supported by evidence in the experiments?

I think yes. See questions 3 and 6.

9. What are the contributions of the paper?

Attention collapse issue.
Re-attention module.
The first successfully trained 32-block ViT on ImageNet-1k and new SOTA.

10. What should/could be done next?

Is the Re-Attention module the best way of exchanging the information from different attention heads in a learnable manner?
The Re-Attention mechanism still collapses after 32 blocks. Any new design to make it possible to stack transformer blocks without limitations?

XFeiF / ComputerVision_PaperNotes