An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Paper
210502의 ArxivTalk에 있던 논문 이야기에서 ViT 모델 언급이 있어 조사하게 되었다.
Introduction
CNN-like architecture with self-attention / applying a standard Transformer directly to images
Transformer: self-attention based architectures: pre-train on a large text corpus and then fine-tune on a smaller task-specific dataset
Self-attention
CNN with Self-attention
Method
split an images into fixed-size patches & provide the sequence of linear embeddings of these patches as an input to a Transformer -> Patches = tokens(words) in an NLP application
Train the model on image classification in supervised fashion
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Paper 210502의 ArxivTalk에 있던 논문 이야기에서 ViT 모델 언급이 있어 조사하게 되었다.
Introduction
CNN-like architecture with self-attention / applying a standard Transformer directly to images
Method