JisuHann / One-day-One-paper

Review paper

3 stars 0 forks source link

ViT - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021) #3

Closed JisuHann closed 3 years ago

JisuHann commented 3 years ago

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper 210502의 ArxivTalk에 있던 논문 이야기에서 ViT 모델 언급이 있어 조사하게 되었다.

Introduction

CNN-like architecture with self-attention / applying a standard Transformer directly to images

Transformer: self-attention based architectures: pre-train on a large text corpus and then fine-tune on a smaller task-specific dataset
Self-attention
- CNN with Self-attention

Method

split an images into fixed-size patches & provide the sequence of linear embeddings of these patches as an input to a Transformer -> Patches = tokens(words) in an NLP application
Train the model on image classification in supervised fashion

JisuHann commented 3 years ago