FoundationVision / VAR

[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
MIT License
4.03k stars 302 forks source link

The performance of VAR Tokenizer #16

Closed youngsheen closed 5 months ago

youngsheen commented 5 months ago

What is the performance of VAR tokenizer? As it is trained on OpenImages while some other VQGAN tokenizers are trained on ImageNet only. I wonder the gain of performance brought by the pre-trained data.

keyu-tian commented 5 months ago

hi @youngsheen, more VQVAE evals are coming in the next paper update.

We trained VQVAE on OpenImages refer to VQGAN (see https://github.com/CompVis/taming-transformers?tab=readme-ov-file#overview-of-pretrained-models).

We actually found training vqvae directly on ImageNet yields slightly better results than OpenImages. We kept using OpanImages to stay aligned with our VQGAN baseline.

luohao123 commented 5 months ago

Does the tokenizer able to do understanding?

huxiaotaostasy commented 5 months ago

I use vqvae in var, and the image produced by encoding and decoding is compared with the original image as follows. image image Is this because the generalization performance of vqvae is not good enough?

keyu-tian commented 5 months ago

@huxiaotaostasy please make sure you denormalize and clamp the output of VQVAE out by out = out.mul(0.5).add_(0.5).clamp_(0, 1).

keyu-tian commented 5 months ago

@luohao123 maybe you can create token maps (r1, r2, ..., rK) by repeating one index in [0, V-1] on all scales and then decode them to see how's the reconstructed image like.