Model based on Transformer, autoregressively models the text and image tokens as a single stream of data.
Method
Problem: using pixel directly as image tokens -> inordinate amount of memory for high-resolution image/ hard to prioritize modeling short-range dependencies
DALL-E: Zero-Shot Text-to-Image Generation
밑의 VideoBERT와 연결하여 읽어보고 싶기도 하였지만 무엇보다도 첫 멀티 모델 논문이라 멋있는 걸 읽어보고 싶어서 시작했다.