Open NorbertZheng opened 1 year ago
BEiT, Pretraining ViT, Using Masked Image Modeling (MIM).
BEiT: BERT Pre-Training of Image Transformers. BEiT, by Microsoft Research. 2022 ICLR, Over 300 Citations.
Self-Supervised Learning, BERT, Transformer, Vision Transformer, ViT, DALL·E.
Overview of BEiT pre-training.
During pre-training, some proportion of image patches are randomly masked, and fed the corrupted input to Transformer.
During pre-training, the images have two views of representations, namely,
Image Patches (Cut from the first figure).
Particularly, BEiT splits each $224\times 224$ image into a $14\times 14$ grid of image patches, where each patch is $16\times 16$.
Visual Tokens (Cut from the first figure).
Specifically, the image of the size $H\times W\times C$ is tokenized into $z=[z{1},…,z{N}]$, where the vocabulary $V={1,…,|V|}$ contains discrete token indices.
BEiT Masked Image Modeling (MIM) (Cut from the first figure).
The pre-training objective is to maximize the log-likelihood of the correct visual tokens $z_{i}$ given the corrupted image:
Blockwise Masking.
Sik-Ho Tang. Review — BEiT: BERT Pre-Training of Image Transformers.