NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tang | Review -- BEiT: BERT Pre-Training of Image Transformers. #79

Open NorbertZheng opened 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — BEiT: BERT Pre-Training of Image Transformers.

NorbertZheng commented 1 year ago

Overview

BEiT, Pretraining ViT, Using Masked Image Modeling (MIM).

BEiT: BERT Pre-Training of Image Transformers. BEiT, by Microsoft Research. 2022 ICLR, Over 300 Citations.

Self-Supervised Learning, BERT, Transformer, Vision Transformer, ViT, DALL·E.

NorbertZheng commented 1 year ago

BEiT Architecture

image Overview of BEiT pre-training.

Overall Approach

During pre-training, some proportion of image patches are randomly masked, and fed the corrupted input to Transformer.

NorbertZheng commented 1 year ago

Image Representation

During pre-training, the images have two views of representations, namely,

Image Patches

image Image Patches (Cut from the first figure).

Particularly, BEiT splits each $224\times 224$ image into a $14\times 14$ grid of image patches, where each patch is $16\times 16$.

Visual Tokens

image Visual Tokens (Cut from the first figure).

Specifically, the image of the size $H\times W\times C$ is tokenized into $z=[z{1},…,z{N}]$, where the vocabulary $V={1,…,|V|}$ contains discrete token indices.

NorbertZheng commented 1 year ago

ViT Backbone

NorbertZheng commented 1 year ago

BEiT Pretraining: Masked Image Modeling (MIM)

Masked Image Modeling (MIM)

image BEiT Masked Image Modeling (MIM) (Cut from the first figure).

The pre-training objective is to maximize the log-likelihood of the correct visual tokens $z_{i}$ given the corrupted image: image

NorbertZheng commented 1 year ago

Blockwise Masking

image Blockwise Masking.

NorbertZheng commented 1 year ago

From VAE Perspective