NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tang | Review -- BEiT: BERT Pre-Training of Image Transformers. #137

Closed NorbertZheng closed 7 months ago

NorbertZheng commented 7 months ago

Sik-Ho Tang. Review — BEiT: BERT Pre-Training of Image Transformers.

NorbertZheng commented 7 months ago

Overview

BEiT: BERT Pre-Training of Image Transformers, BEiT, by Microsoft Research, 2022 ICLR, Over 300 Citations.

NorbertZheng commented 7 months ago

BEiT Architecture

image

Overall Approach

During pre-training, some proportion of image patches are randomly masked, and fed the corrupted input to Transformer. The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches.

NorbertZheng commented 7 months ago

Image Representation

The images have two views of representations, namely, image patch, and visual tokens. The two types serve as input and output representations during pre-training, respectively.

Image Patches

image Image Patches (Cut from the first figure).

Particularly, BEiT splits each $224\times 224$ image into a $14\times 14$ grid of image patches, where each patch is $16\times 16$.

Visual Tokens

image Visual Tokens (Cut from the first figure).

Specifically, the image of the size $H\times W\times C$ is tokenized into $z=[z{1},…,z{N}]$, where the vocabulary $V=\{1,…,|V|\}$ contains discrete token indices.

NorbertZheng commented 7 months ago

ViT Backbone

which are used as the encoded representations for the image patches, where $h_{i}^{L}$ is the vector of the $i$-th image patch.

NorbertZheng commented 7 months ago

BEiT Pretraining: Masked Image Modeling (MIM)

Masked Image Modeling (MIM)

image BEiT Masked Image Modeling (MIM) (Cut from the first figure).

The pre-training objective is to maximize the log-likelihood of the correct visual tokens $z_{i}$ given the corrupted image: image

NorbertZheng commented 7 months ago

Blockwise Masking

image Blockwise Masking.

Blocks of patches are masked randomly as shown in the figure and algorithm above, instead of masking each patch individually in a random manner.

NorbertZheng commented 7 months ago

From VAE Perspective

The BEiT pre-training can be viewed as variational autoencoder training: image

Thus, the above equation is re-written as: image

where the second term is the proposed BEiT pre-training objective.

NorbertZheng commented 7 months ago

Experimental Results

ImageNet-1K & ImageNet-22K Pretraining, Image Classification on ImageNet-1K

image Top-1 accuracy on ImageNet-1K using full fine-tuning.

BEiT improves the performance on ImageNet, which shows the effectiveness under the rich-resource setting.

More importantly, BEiT-384 pretrained on ImageNet-1K even outperforms supervised pre-training ViT-384 that uses ImageNet-22K, when they use the same input resolution.

image Convergence curves of training DeiT from scratch and fine-tuning BEiT on ImageNet-1K.

Fine-tuning BEiT not only achieves better performance, but also converging much faster than training DeiT from scratch.

NorbertZheng commented 7 months ago

Semantic Segmentation on ADE20K

image Results of semantic segmentation on ADE20K.

BEiT achieves better performance than supervised pretraining, although BEiT does not require manual annotations for pre-training.

Intermediate fine-tuning further improves BEiT on semantic segmentation.

NorbertZheng commented 7 months ago

Ablation Study

image Ablation studies for BEiT pre-training on image classification and semantic segmentation.

NorbertZheng commented 7 months ago

Analysis on Self-Attention Map

image Self-attention map for different reference points.

After pre-training, BEiT learns to distinguish semantic regions using self-attention heads, without any task-specific supervision. Such knowledge acquired by BEiT potentially improves the generalization ability of fine-tuned models, especially on small-scale datasets.

NorbertZheng commented 7 months ago

Further Results Using LayerScale in CaiT and Relative Position in Shaw NAACL’18 (Paper Appendix)

Effects of LayerScale in CaiT & Relative Position in Shaw NAACL’18

image Ablation studies of architecture variants on image classification and semantic segmentation.

LayerScale in CaiT, and relative position bias in Shaw NAACL’18, improve performance on ImageNet classification and ADE20K semantic segmentation.

NorbertZheng commented 7 months ago

ImageNet

image Top-1 accuracy on ImageNet-1K fine-tuning.

BEiT-L fine-tuned on ImageNet-22K achieves comparable performance with ViT-L trained on Google JFT-3B.

NorbertZheng commented 7 months ago

ADE20K

image Performance comparison on the ADE20K semantic segmentation.

The BEiT-L model obtains state-of-the-art performance on ADE20K, outperforms Swin Transformer.

NorbertZheng commented 7 months ago

DINO applies self-supervised learning on ViT using similar idea as BYOL. BEiT even uses the BERT pretraining concept to have self-supervised learning on ViT.

NorbertZheng commented 7 months ago

Reference