Closed NorbertZheng closed 11 months ago
Going Deeper with Image Transformers, CaiT, by Facebook AI, and Sorbonne University, 2021 ICCV, Over 100 Citations, Image Classification, Transformer, Vision Transformer, ViT
From (a) ViT, to (d) ViT Using Proposed LayerScale.
(a) Vision Transformer (ViT): instantiates a particular form of residual architecture: After casting the input image into a set of vectors $x_{0}$, the network alternates self-attention layers (SA) with feed-forward networks (FFN), as: where $\eta$ is the layer normalization.
(b) Fixup [75], ReZero [2] and SkipInit [16]: introduce learnable scalar weighting $alpha_{l}$ on the output of residual blocks, while removing the pre-normalization and the warmup: The empirical observation in this paper is that removing the warmup and the layer normalization is what makes training unstable in Fixup and T-Fixup.
(c) Both Layer Norm and Learnable Scalar Weighting: When initialized at a small value, this choice does help the convergence when increasing the depth.
(d) LayerScale: is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar. The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block: where the parameters $\lambda{l,i}$ and $\lambda'{l,i}$ are learnable weights.
LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar.
CLS Token Places and Interactions.
Improving convergence at depth on ImageNet-1k.
LayerScale outperforms other weighting variants and baselines.
Variations on CLS with DeiT-Small (no LayerScale).
Using late CLS insertion obtains better results.
With class-attention stage, further improvement is observed.
CaiT Model Variants.
CaiT model variants are constructed from XXS-24 to M-36.
SOTA Comparison.
CaiT can go deeper with better performance.
CaiT obtains higher accuracy compared with others.
Results in transfer learning.
CaiT obtains better performance after fine-tuned to downstream tasks. Ablation path from DeiT-S to our CaiT models.
Other than CaiT techniques, techniques from other papers, such as distillation in DeiT, are also used.
Illustration of the regions of focus of a CaiT-XXS model, according to the response of the first class-attention layer (Some of them are shown here only).
Sik-Ho Tang. Review — CaiT: Going Deeper with Image Transformers.