Sik-Ho Tang | Review -- BEiT: BERT Pre-Training of Image Transformers.

NorbertZheng commented 11 months ago

Sik-Ho Tang. Review — BEiT: BERT Pre-Training of Image Transformers.

NorbertZheng commented 11 months ago

Overview

BEiT: BERT Pre-Training of Image Transformers, BEiT, by Microsoft Research, 2022 ICLR, Over 300 Citations.

Bidirectional Encoder representation from Image Transformers (BEiT) is proposed, where a masked image modeling (MIM) task to pretrain Vision Transformers.
BEiT first “tokenizes” the original image into visual tokens. Then some image patches are randomly masked and fed into the backbone Transformer.
The pre-training objective is to recover the original visual tokens based on the corrupted image.

NorbertZheng commented 11 months ago

BEiT Architecture

Overall Approach

Inspired by BERT, a pre-training task is proposed, namely, masked image modeling (MIM).
MIM uses two views for each images, i.e., image patches, and visual tokens.
The image is split into a grid of patches that are the input representation of backbone Transformer.
The image is “tokenized” to discrete visual tokens by the latent codes of discrete VAE, where discrete VAE is from DALL·E.

During pre-training, some proportion of image patches are randomly masked, and fed the corrupted input to Transformer. The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches.

NorbertZheng commented 11 months ago

Image Representation

The images have two views of representations, namely, image patch, and visual tokens. The two types serve as input and output representations during pre-training, respectively.

Image Patches

Image Patches (Cut from the first figure).

The 2D image of the size $H\times W\times C$ is split into a sequence of patches $x_{p}$ ($p$ is from $1$ to $N$) of the size $P^{2}$, with the number of patch $N=\frac{HW}{P^{2}}$ patches.
The image patches $x_{p}$ are flattened into vectors and are linearly projected which is similar to word embeddings in BERT.

Particularly, BEiT splits each $224\times 224$ image into a $14\times 14$ grid of image patches, where each patch is $16\times 16$.

Visual Tokens

Visual Tokens (Cut from the first figure).

The image is represented as a sequence of discrete tokens obtained by an “image tokenizer”, instead of raw pixels.

Specifically, the image of the size $H\times W\times C$ is tokenized into $z=[z{1},…,z{N}]$, where the vocabulary $V=\{1,…,|V|\}$ contains discrete token indices.

The image tokenizer learned by discrete variational autoencoder (dVAE), by DALL·E, is directly used.
There are two modules during visual token learning, namely, tokenizer and decoder.
The tokenizer $q(z|x)$ maps image pixels $x$ into discrete tokens $z$ according to a visual codebook (i.e., vocabulary).
The decoder $p(x|z)$ learns to reconstruct the input image $x$ based on the visual tokens $z$.
The vocabulary size is set to $|V| = 8192$.

NorbertZheng commented 11 months ago

ViT Backbone

Following ViT, the Transformer backbone network is used.
The input of Transformer is a sequence of image patches $x_{i}^{p}$.
The patches are then linearly projected to obtain patch embeddings $Ex_{i}^{p}$.
The standard learnable 1D position embeddings $E_{pos}$ are added to patch embeddings:
The encoder contains L layers of Transformer blocks:
The output vectors of the last layer is:

which are used as the encoded representations for the image patches, where $h_{i}^{L}$ is the vector of the $i$-th image patch.

ViT-Base is used, which is a 12-layer Transformer with 768 hidden size, and 12 attention heads. The intermediate size of feed-forward networks is 3072.

NorbertZheng commented 11 months ago

BEiT Pretraining: Masked Image Modeling (MIM)

Masked Image Modeling (MIM)

BEiT Masked Image Modeling (MIM) (Cut from the first figure).

After spliting the image into image patches, as described above, approximately 40% image patches are randomly masked, where the masked positions are denoted as $M$. The masked patches are replaced with a learnable embedding $e[M]$. In BEiT, at most 75 patches are masked.
Then, the good and masked image patches are input into the L-layer Transformer.
A softmax classifier is used to predict the corresponding visual tokens:

The pre-training objective is to maximize the log-likelihood of the correct visual tokens $z_{i}$ given the corrupted image:

BEiT is pretrained on the training set of ImageNet-1K.
The pre-training runs for about 500k steps (i.e., 800 epochs) with 2k batch size. The 500k training steps take about five days using 16 Nvidia Tesla V100 32GB GPU cards.

NorbertZheng commented 11 months ago

Blockwise Masking

Blockwise Masking.

Blocks of patches are masked randomly as shown in the figure and algorithm above, instead of masking each patch individually in a random manner.

NorbertZheng commented 11 months ago

From VAE Perspective

The BEiT pre-training can be viewed as variational autoencoder training:

In the first stage, the image tokenizer is obtained as a discrete variational autoencoder. Specifically, the first stage minimizes the reconstruction loss, with an uniform prior.
In the second stage, the prior $p{\theta}$ is learnt while keeping $q{\phi}$ and $p_{\psi}$ fixed.

Thus, the above equation is re-written as:

where the second term is the proposed BEiT pre-training objective.

NorbertZheng commented 11 months ago

Experimental Results

ImageNet-1K & ImageNet-22K Pretraining, Image Classification on ImageNet-1K

Top-1 accuracy on ImageNet-1K using full fine-tuning.

A simple linear classifier is employed as the task layer. Average pooling is used to aggregate the representations, and the global is fed to a softmax classifier.
Pre-trained BEiT significantly improves performance on both datasets.

BEiT improves the performance on ImageNet, which shows the effectiveness under the rich-resource setting.

Higher resolution improves the BEiT results by 1+ points on ImageNet.

More importantly, BEiT-384 pretrained on ImageNet-1K even outperforms supervised pre-training ViT-384 that uses ImageNet-22K, when they use the same input resolution.

Convergence curves of training DeiT from scratch and fine-tuning BEiT on ImageNet-1K.

Fine-tuning BEiT not only achieves better performance, but also converging much faster than training DeiT from scratch.

NorbertZheng commented 11 months ago

Semantic Segmentation on ADE20K

Results of semantic segmentation on ADE20K.

The task layer used in SETR-PUP (Zheng et al., 2020), is used.
To be specific, the pretrained BEiT is used as a backbone encoder, and several deconvolution layers are incorporated as decoder to produce segmentation.

BEiT achieves better performance than supervised pretraining, although BEiT does not require manual annotations for pre-training.

Intermediate fine-tuning is performed for BEiT on ImageNet, i.e., first pretrained BEiT is fine-tuned on ImageNet, and then the model is fine-tuned on ADE20K.

Intermediate fine-tuning further improves BEiT on semantic segmentation.

NorbertZheng commented 11 months ago

Ablation Study

Ablation studies for BEiT pre-training on image classification and semantic segmentation.

Blockwise masking is beneficial on both tasks, especially on semantic segmentation.
The proposed masked image modeling (MIM) task significantly outperforms naïve pixel-level auto-encoding. The results indicate that the prediction of visual tokens is the key ingredient of BEiT.
Recovering all the visual tokens harms performance on downstream tasks.
Pre-training the model longer (800 epochs) can further improve performance on downstream tasks.

NorbertZheng commented 11 months ago

Analysis on Self-Attention Map

Self-attention map for different reference points.

The self-attention mechanism in BEiT can separate objects.

After pre-training, BEiT learns to distinguish semantic regions using self-attention heads, without any task-specific supervision. Such knowledge acquired by BEiT potentially improves the generalization ability of fine-tuned models, especially on small-scale datasets.

NorbertZheng commented 11 months ago

Further Results Using LayerScale in CaiT and Relative Position in Shaw NAACL’18 (Paper Appendix)

Effects of LayerScale in CaiT & Relative Position in Shaw NAACL’18

Ablation studies of architecture variants on image classification and semantic segmentation.

LayerScale in CaiT, and relative position bias in Shaw NAACL’18, improve performance on ImageNet classification and ADE20K semantic segmentation.

NorbertZheng commented 11 months ago

ImageNet

Top-1 accuracy on ImageNet-1K fine-tuning.

BEiT-L fine-tuned on ImageNet-22K achieves comparable performance with ViT-L trained on Google JFT-3B.

NorbertZheng commented 11 months ago

ADE20K

Performance comparison on the ADE20K semantic segmentation.

The BEiT-L model obtains state-of-the-art performance on ADE20K, outperforms Swin Transformer.

NorbertZheng commented 11 months ago

DINO applies self-supervised learning on ViT using similar idea as BYOL. BEiT even uses the BERT pretraining concept to have self-supervised learning on ViT.

NorbertZheng commented 11 months ago

Reference

[2022 ICLR] [BEiT] BEiT: BERT Pre-Training of Image Transformers.

NorbertZheng / read-papers