FoundationVision / VAR

[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
MIT License
3.8k stars 285 forks source link

About the input resolution scale #8

Closed luohao123 closed 2 months ago

luohao123 commented 3 months ago

From the method mentioned in the paper, if the output resolution are huge, such as 1024x2048, the actually generation time would be much more large than diffusion model.

So, in large image generation, what is the strength actually of this method?

iFighting commented 3 months ago

From the method mentioned in the paper, if the output resolution are huge, such as 1024x2048, the actually generation time would be much more large than diffusion model.

So, in large image generation, what is the strength actually of this method?

@luohao123

It's not actually the case for a couple of reasons:

  1. After going through a VAE, the diffusion transformer produces a very large number of tokens. For instance, with a 1024 * 2048 image and 16x downsampling, this results in a sequence of 8192 tokens. Given that the token sequence is quite long and diffusion requires predicting many steps – typically between 20 and 30 – the process in practice takes a significant amount of time.
  2. Our tokenizer is a multi-scale VQ-VAE, which can generate images in multiple steps where each step predicts in parallel, therefore the generation time is much faster compared to diffusion.
luohao123 commented 3 months ago

Thanks for the reply, the demo images are all not high resolution enough, for instance, 104x1024, and th quality, actaully still a gap between SOTA diffusion models.

What could be reason for this?

iFighting commented 3 months ago

Thanks for the reply, the demo images are all not high resolution enough, for instance, 104x1024, and th quality, actaully still a gap between SOTA diffusion models.

What could be reason for this?

@luohao123

This is a class-conditioned image generation framework. We conducted experiments with ImageNet at resolutions of 256x256 and 512x512, and compared the class-conditioned results at these lower resolutions with those text-to-image diffusion models at 1024x1024, which is clearly unfair. Additionally, are you referring to text-to-image generation? We will have work in this area released soon.

luohao123 commented 3 months ago

Yes, am telling text-to-image, looks like current is conditioned image generation.

Does the text-to-image with next-scale prediction, comparable with diffuison models such as stablediffusion on large size images?

iFighting commented 3 months ago

Yes, am telling text-to-image, looks like current is conditioned image generation.

Does the text-to-image with next-scale prediction, comparable with diffuison models such as stablediffusion on large size images?

@luohao123

Under the same conditions of data quality and training epochs, VAR can achieve or even exceed the performance of diffusion.

Additionally, it is worth noting that our method has better scalability, meaning that when the model size scales up to 10B, 20B, the potential of our model will be even greater.

luohao123 commented 2 months ago

really looking forward to your text to image work out!