Closed NorbertZheng closed 11 months ago
BEiT: BERT Pre-Training of Image Transformers, BEiT, by Microsoft Research, 2022 ICLR, Over 300 Citations.
During pre-training, some proportion of image patches are randomly masked, and fed the corrupted input to Transformer. The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches.
The images have two views of representations, namely, image patch, and visual tokens. The two types serve as input and output representations during pre-training, respectively.
Image Patches (Cut from the first figure).
Particularly, BEiT splits each $224\times 224$ image into a $14\times 14$ grid of image patches, where each patch is $16\times 16$.
Visual Tokens (Cut from the first figure).
Specifically, the image of the size $H\times W\times C$ is tokenized into $z=[z{1},…,z{N}]$, where the vocabulary $V=\{1,…,|V|\}$ contains discrete token indices.
which are used as the encoded representations for the image patches, where $h_{i}^{L}$ is the vector of the $i$-th image patch.
BEiT Masked Image Modeling (MIM) (Cut from the first figure).
The pre-training objective is to maximize the log-likelihood of the correct visual tokens $z_{i}$ given the corrupted image:
Blockwise Masking.
Blocks of patches are masked randomly as shown in the figure and algorithm above, instead of masking each patch individually in a random manner.
The BEiT pre-training can be viewed as variational autoencoder training:
Thus, the above equation is re-written as:
where the second term is the proposed BEiT pre-training objective.
Top-1 accuracy on ImageNet-1K using full fine-tuning.
BEiT improves the performance on ImageNet, which shows the effectiveness under the rich-resource setting.
More importantly, BEiT-384 pretrained on ImageNet-1K even outperforms supervised pre-training ViT-384 that uses ImageNet-22K, when they use the same input resolution.
Convergence curves of training DeiT from scratch and fine-tuning BEiT on ImageNet-1K.
Fine-tuning BEiT not only achieves better performance, but also converging much faster than training DeiT from scratch.
Results of semantic segmentation on ADE20K.
BEiT achieves better performance than supervised pretraining, although BEiT does not require manual annotations for pre-training.
Intermediate fine-tuning further improves BEiT on semantic segmentation.
Ablation studies for BEiT pre-training on image classification and semantic segmentation.
Self-attention map for different reference points.
After pre-training, BEiT learns to distinguish semantic regions using self-attention heads, without any task-specific supervision. Such knowledge acquired by BEiT potentially improves the generalization ability of fine-tuned models, especially on small-scale datasets.
Ablation studies of architecture variants on image classification and semantic segmentation.
LayerScale in CaiT, and relative position bias in Shaw NAACL’18, improve performance on ImageNet classification and ADE20K semantic segmentation.
Top-1 accuracy on ImageNet-1K fine-tuning.
BEiT-L fine-tuned on ImageNet-22K achieves comparable performance with ViT-L trained on Google JFT-3B.
Performance comparison on the ADE20K semantic segmentation.
The BEiT-L model obtains state-of-the-art performance on ADE20K, outperforms Swin Transformer.
Sik-Ho Tang. Review — BEiT: BERT Pre-Training of Image Transformers.