Sik-Ho Tang | Review -- BEiT: BERT Pre-Training of Image Transformers.

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — BEiT: BERT Pre-Training of Image Transformers.

NorbertZheng commented 1 year ago

Overview

BEiT, Pretraining ViT, Using Masked Image Modeling (MIM).

BEiT: BERT Pre-Training of Image Transformers. BEiT, by Microsoft Research. 2022 ICLR, Over 300 Citations.

Self-Supervised Learning, BERT, Transformer, Vision Transformer, ViT, DALL·E.

Bidirectional Encoder representation from Image Transformers (BEiT) is proposed, where a masked image modeling (MIM) task to pretrain Vision Transformers.
BEiT first “tokenizes” the original image into visual tokens. Then some image patches are randomly masked and fed into the backbone Transformer.
The pre-training objective is to recover the original visual tokens based on the corrupted image.

NorbertZheng commented 1 year ago

BEiT Architecture

Overview of BEiT pre-training.

Overall Approach

Inspired by BERT, a pre-training task is proposed, namely, masked image modeling (MIM).
MIM uses two views for each images, i.e., image patches, and visual tokens.
The image is split into a grid of patches that are the input representation of backbone Transformer.
The image is “tokenized” to discrete visual tokens by the latent codes of discrete VAE, where discrete VAE is from DALL·E.

During pre-training, some proportion of image patches are randomly masked, and fed the corrupted input to Transformer.

The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches.

NorbertZheng commented 1 year ago

Image Representation

During pre-training, the images have two views of representations, namely,

image patch: input representations,
visual tokens output representations.

Image Patches

Image Patches (Cut from the first figure).

The 2D image of the size $H\times W\times C$ is split into a sequence of patches $x_{p}$ ($p$ is from $1$ to $N$) of the size $P^{2}$, with the number of patch $N=\frac{HW}{P^{2}}$ patches.
The image patches $x_{p}$ are flattened into vectors and are linearly projected which is similar to word embeddings in BERT.

Particularly, BEiT splits each $224\times 224$ image into a $14\times 14$ grid of image patches, where each patch is $16\times 16$.

Visual Tokens

Visual Tokens (Cut from the first figure).

The image is represented as a sequence of discrete tokens obtained by an “image tokenizer”, instead of raw pixels.

Specifically, the image of the size $H\times W\times C$ is tokenized into $z=[z{1},…,z{N}]$, where the vocabulary $V={1,…,|V|}$ contains discrete token indices.

The image tokenizer learned by discrete variational autoencoder (dVAE), by DALL·E, is directly used.
There are two modules during visual token learning, namely, tokenizer and decoder.
The tokenizer $q(z|x)$ maps image pixels $x$ into discrete tokens $z$ according to a visual codebook (i.e., vocabulary).
The decoder $p(x|z)$ learns to reconstruct the input image $x$ based on the visual tokens $z$.
The vocabulary size is set to $|V| = 8192$.

NorbertZheng commented 1 year ago

ViT Backbone

Following ViT, the Transformer backbone network is used.
The input of Transformer is a sequence of image patches $x_{i}^{p}$.
The patches are then linearly projected to obtain patch embeddings $Ex_{i}^{p}$.
The standard learnable 1D position embeddings $E_{pos}$ are added to patch embeddings:
The encoder contains $L$ layers of Transformer blocks:
The output vectors of the last layer is: which are used as the encoded representations for the image patches, where $h_{i}^{L}$ is the vector of the $i$-th image patch.
ViTBase is used, which is a 12-layer Transformer with 768 hidden size, and 12 attention heads. The intermediate size of feed-forward networks is 3072.

NorbertZheng commented 1 year ago

BEiT Pretraining: Masked Image Modeling (MIM)

Masked Image Modeling (MIM)

BEiT Masked Image Modeling (MIM) (Cut from the first figure).

After splitting the image into image patches, as described above, approximately 40% image patches are randomly masked, where the masked positions are denoted as $M$. The masked patches are replaced with a learnable embedding $e[M]$. In BEiT, At most 75 patches are masked.
Then, the good and masked image patches are input into the L-layer Transformer.
A softmax classifier is used to predict the corresponding visual tokens:

The pre-training objective is to maximize the log-likelihood of the correct visual tokens $z_{i}$ given the corrupted image:

BEiT is pretrained on the training set of ImageNet-1K.
The pre-training runs for about 500k steps (i.e., 800 epochs) with 2k batch size. The 500k training steps take about five days using 16 Nvidia Tesla V100 32GB GPU cards.

NorbertZheng commented 1 year ago

Blockwise Masking

Blockwise Masking.

Blocks of patches are masked randomly as shown in the figure and algorithm above, instead of ~~masking each patch individually in a random manner (not stable)~~.

NorbertZheng commented 1 year ago

From VAE Perspective

The BEiT pre-training can be viewed as variational autoencoder training:
In the first stage, the image tokenizer is obtained as a discrete variational autoencoder. Specifically, the first stage minimizes the reconstruction loss, with an uniform prior.
In the second stage, the prior $p{\theta}$ is learnt while keeping $q{\phi}$ and $q_{\Psi}$ fixed.
Thus, the above equation is re-written as: where the second term is the proposed BEiT pre-training objective.

NorbertZheng / read-papers