FoundationVision / VAR

[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
MIT License
3.78k stars 285 forks source link

Scalability to multimodal large language models? #51

Open DEBIHOOD opened 1 month ago

DEBIHOOD commented 1 month ago

Hi, i was looking around to see what's new in the image generation field, and i've spotted your paper as quite interesting one! The idea behind predicting next resolution(i feel like it's somewhat similar to how progressive growing GAN, and it's successor StyleGAN have handled it) instead of predicting next flattened token feels quite interesting, moreover predicting them all at once! Not really sure that i'm feeling completely comfortable with methods that work in this pyramid-like structure of low-to-high(Prog. growing GAN, StyleGAN, StyleGAN 2, StyleGAN 3 have shown some of the issues that this architecture can create), U-Net feels a bit more intuitive on that part of spectrum, that it operates on the whole image, but inside it also does these decompositions to lower resolutions, but hey, if it worked, and worked better than everything else that existed prior to that, i like it! So my question is, we have seen the potential of scalability of transformers, we have seen that 1 big transformer can work better than many small transformers(like with the case of language translation), we even have seen the papers that try to combine it all into one big multimodal transformer(text tokens, image tokens, audio tokens), or even discarding tokens altogether and working in the space of bytes (MambaByte paper, not quite the transformer, but why not). How can we apply VAR for big multimodal LLM? Or a bimodal for the sake of simplicity(as audio tokens is already flat). With prior methods we can just flatten the image tokens. Add tokens of the autoencoder to the vocabulary of LLM and pass a special tokens ⟨img⟩ ⟨/img⟩ once we are working with images.

By the way, in the paper you also pointed out that by using rejection sampling it achieved better scores, but what exactly is rejection sampling? Is it just applying top-k and CFG vs sampling from original probabilities?

Thanks for your work, looking forward for your text2image sequel.