FoundationVision / VAR

[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
MIT License
3.77k stars 285 forks source link

can it do super resolution? #3

Open francqz31 opened 2 months ago

francqz31 commented 2 months ago

Can VAR do super resolution like GigaGan super resolution for example. Gigagan is the most impressive super resolution algorithm till now. And if yes would you be able to add support for it Later next month or so?

keyu-tian commented 2 months ago

VAR supports zero-shot super resolution. Although it might not rival the GigaGan upsampler, we're planning to release a demo for testing in the coming days. Stay tuned for updates!

judywxy1122 commented 2 months ago

Hi keyu,

I'm Bingyue's friend and I'm very impressed with this work!

I have a question regarding large size image with super high resolution.

First, let me try to understand the fundamental logic. Correct me if I'm wrong. 1) The basic idea is to establish a self-supervised learning mechanism. In VAR, we follow the process as raw img -> embedding f -> forward: (r_K -> ... -> r_1) -> backward: (r_1 -> ... -> r_K) -> recovered embedding f^ -> reconstructed img

i.e. from fine to coarse and then inversely from coarse to fine

2) The learning is based on the probabilistic generative model for the conditional generation probabilities: P(rk | r(k-1), ..., r_1) for k = 1, ..., K, with r_0 as a pre-defined start (, i.e. guidence).

Based on this understanding, regarding a large size image with super high resolution, we can set the dimension of the the embedding vector f to be higher for more representation capability.

Considering the main-stream techniques like the one in the paper “Scalable Diffusion Models with Transformers”, one technique is to “patchify” the raw image into patches (i.e. tokens) and then find the “best” embedding of each patch by a transformer-architecture based learning. When each token embedding is decoded back into a “predicted” patch, then all the “predicted” patches can be re-organized together to recover the whole image.

Now, the QUESTION is: can we also do the “patchify” and then apply the fine→coarse→fine process to each patch and then reorganize the “predicted” patches to recover the whole image?

Not quite sure which of the two method is better. I mean a) setting dimension of the the embedding vector f to be higher for more representation capability b) patchifing the raw image into patches, working on each patch, and the piecing together the “predicted” patches

One intuition for the “patchify” in method b) is that there could be some un-smoothed piecing-together when the computation for the optimization process is not enough yet. Note that breaking a whole image into pieces actually destroys the spatial connection information of the pieces. The method a) does not need to deal with the problem of piecing-together because the embedding is regarding the whole image.

Best,

Xugang Ye

keyu-tian commented 2 months ago

@judywxy1122 Thank you for your kind words! The question is a bit detailed; let me give it some thought and I'll get back to you shortly.