JisuHann / One-day-One-paper

Review paper
3 stars 0 forks source link

ViT - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ICLR 2021) #3

Closed JisuHann closed 3 years ago

JisuHann commented 3 years ago

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper 210502의 ArxivTalk에 있던 논문 이야기에서 ViT 모델 언급이 있어 조사하게 되었다.

Introduction

CNN-like architecture with self-attention / applying a standard Transformer directly to images

Method

  1. split an images into fixed-size patches & provide the sequence of linear embeddings of these patches as an input to a Transformer -> Patches = tokens(words) in an NLP application
  2. Train the model on image classification in supervised fashion
JisuHann commented 3 years ago

정리 완료