Sik-Ho Tang | Review -- CaiT: Going Deeper with Image Transformers.

NorbertZheng commented 11 months ago

Sik-Ho Tang. Review — CaiT: Going Deeper with Image Transformers.

NorbertZheng commented 11 months ago

Overview

Going Deeper with Image Transformers, CaiT, by Facebook AI, and Sorbonne University, 2021 ICCV, Over 100 Citations, Image Classification, Transformer, Vision Transformer, ViT

CaiT (Class-Attention in Image Transformers) is proposed.
LayerScale significantly facilitates the convergence and improves the accuracy of image transformers at larger depths.
Layers with specific class-attention offers a more effective processing of the class embedding.

NorbertZheng commented 11 months ago

Deeper Image Transformers with LayerScale

From (a) ViT, to (d) ViT Using Proposed LayerScale.

(a) Vision Transformer (ViT): instantiates a particular form of residual architecture: After casting the input image into a set of vectors $x_{0}$, the network alternates self-attention layers (SA) with feed-forward networks (FFN), as: where $\eta$ is the layer normalization.
(b) Fixup [75], ReZero [2] and SkipInit [16]: introduce learnable scalar weighting $alpha_{l}$ on the output of residual blocks, while removing the pre-normalization and the warmup: The empirical observation in this paper is that removing the warmup and the layer normalization is what makes training unstable in Fixup and T-Fixup.
(c) Both Layer Norm and Learnable Scalar Weighting: When initialized at a small value, this choice does help the convergence when increasing the depth.
(d) LayerScale: is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar. The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block: where the parameters $\lambda{l,i}$ and $\lambda'{l,i}$ are learnable weights.

NorbertZheng commented 11 months ago

LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar.

NorbertZheng commented 11 months ago

Specializing Layers for Class Attention

CLS Token Places and Interactions.

(Left) ViT: The class embedding (CLS) is inserted along with the patch embeddings.
(Middle): Inserting CLS token later improves the performance.
(Right) CaiT: Further proposes to freeze the patch embeddings when inserting CLS to save compute, so that the last part of the network (typically 2 layers) is fully devoted to summarizing the information to be fed to the linear classifier.

NorbertZheng commented 11 months ago

Experimental Results

LayerScale

Improving convergence at depth on ImageNet-1k.

LayerScale outperforms other weighting variants and baselines.

NorbertZheng commented 11 months ago

Class-Attention Stage

Variations on CLS with DeiT-Small (no LayerScale).

Using late CLS insertion obtains better results.

With class-attention stage, further improvement is observed.

NorbertZheng commented 11 months ago

Cait Model Variants

CaiT Model Variants.

CaiT model variants are constructed from XXS-24 to M-36.

NorbertZheng commented 11 months ago

SOTA Comparison

SOTA Comparison.

CaiT can go deeper with better performance.

CaiT obtains higher accuracy compared with others.

Results in transfer learning.

CaiT obtains better performance after fine-tuned to downstream tasks. Ablation path from DeiT-S to our CaiT models.

Other than CaiT techniques, techniques from other papers, such as distillation in DeiT, are also used.

Illustration of the regions of focus of a CaiT-XXS model, according to the response of the first class-attention layer (Some of them are shown here only).

NorbertZheng commented 11 months ago

Reference

[2021 ICCV] [CaiT] Going Deeper with Image Transformers.

NorbertZheng / read-papers