FoundationVision / VAR

[NeurIPS 2024 Oral][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
MIT License
4.3k stars 316 forks source link

FID misalignment #45

Closed ckczzj closed 6 months ago

ckczzj commented 7 months ago

Great works and thanks for publishing the code!

I encountered some problems when calculating the FID.

Originally I use my own FID calculation code to calculate the FID between the images from ImageNet validation dataset and their VQVAE autoencoding reconstruction, and the result is 0.92, which is make sense. However, when I use the same code to calculate the FID between the images from ImageNet validation dataset and your d16 conditional generated images, the result is 19.13.

I also use your method to calculate the FID: we use the following code to create a npz file and run python evaluator.py VIRTUAL_imagenet256_labeled.npz tmp.npz. The result is 18.25. Do your have any idea where I make mistakes.

for batch_id, batch in enumerate(imagenet_val_dataloader):
    label = batch["label"]
    gen_images = var.autoregressive_infer_cfg(
                B=label.shape[0],
                label_B=label.to(device),
                cfg=1.5,
                top_k=900,
                top_p=0.96,
                more_smooth=False
            )
    gen_images = gen_images.mul(255).add(0.5).clamp(0, 255).permute(0, 2, 3, 1).to('cpu', torch.uint8).numpy()
    for i in range(gen_images.shape[0]):
        Image.fromarray(gen_images[i]).save(os.path.join("./tmp/", str(batch_id) + "_" + str(i) +  ".png"))

create_npz_from_sample_folder("./tmp")
keyu-tian commented 7 months ago

@ckczzj could you try it again by re-downloading the model weights from HF and with the latest, original code? And also load model state dict with var.load_state_dict(var_ckpt, strict=True).

ckczzj commented 6 months ago

Thanks for replying. Today I re-clone the repo and re-download the ckpt (and load model with strict=True) to calculate the FID for d16 model, and the result is still 18.25.

keyu-tian commented 6 months ago

@ckczzj could you save the images into 1000 folders according to their labels (from 0 to 999)? and then visualize the label 980 (it'll have 50 images) to have a quick check.

ckczzj commented 6 months ago

I have solved the problem. Thanks for your replying.