FoundationVision / VAR

[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
MIT License
4.03k stars 303 forks source link

FID on Class-Conditioned Evaluation & Normalized Attn in Depth 16/30 #38

Closed lxa9867 closed 5 months ago

lxa9867 commented 5 months ago

Hi Keyu,

Thanks for sharing this interesting work. I have two questions and hope to get some insights from you.

(1) I am trying to evaluate the released ckpt on the ImageNet eval set. The reproduced FID is about ~7 for the depth-16 model. I consider there should be a config difference between our testing (we used the configs in the released demo and torchmetrics to evaluate) and the reported experiment. Would you please let us know the specific parameters setting for sampling during the evaluation? Thanks.

(2) I noticed that in a recent update, the released ckpts were replaced and so were the configs of depth 16/30 var configs. I wonder what is the influence of normalized attention and what about the performance improvement based on this modification. Thanks.

Best, A loyal reader of your paper

keyu-tian commented 5 months ago

hi @lxa9867 you can use the configurations at https://github.com/FoundationVision/VAR?tab=readme-ov-file#sampling--zero-shot-inference to sample 50k images (50 images per class). To reproduce FID, it needs to use the OpenAI evaluation protocol in https://github.com/openai/guided-diffusion/tree/main/evaluations.

Normalized attention can stablize the training when fp16 is used. If trained with fp32, it should yield similar results.

lxa9867 commented 5 months ago

Got it. Thanks for the quick response!