cientgu / VQ-Diffusion

MIT License
429 stars 43 forks source link

Cannot reproduce the FID/IS result of ImageNet? #14

Open JohnDreamer opened 2 years ago

JohnDreamer commented 2 years ago

Hi, it is a great work! These days, I have trained the model with configs/imagenet.yaml. The model is trained on 8 gpus and 100 epochs. But the FID result is only 20, which is far away from 11.89 reported in your paper. I evaluate model with the following steps: (1) process all the training images: resize the shot edge of the image to 256 and center crop it to 256256; (2) sample the images, 50 images per class, totally 50 1000 = 50K images; (3) use torch-fidelity to calculate the 50K sampled images with the processed training images. Do I make any mistake? Are 100 epochs enough for training? Could you please provide some details about training ImageNet, such as epoch? It will be very appreciated to get the reply from you! Thanks!

cientgu commented 2 years ago

Sorry for the late reply. I think you may notice two important things. First, in the sampling process, the truncation rate is extremely important, we searched the best truncation rate and select it as 0.869 in our model to test FID. Second, we followed the previous works like GLIDE, VQGAN to calculate FID between 50k generated images and all the training images. Personally I don't think it's a good evaluation metric, however, it do achieve a lower FID score.

JohnDreamer commented 2 years ago

Sorry for the late reply. I think you may notice two important things. First, in the sampling process, the truncation rate is extremely important, we searched the best truncation rate and select it as 0.869 in our model to test FID. Second, we followed the previous works like GLIDE, VQGAN to calculate FID between 50k generated images and all the training images. Personally I don't think it's a good evaluation metric, however, it do achieve a lower FID score.

Thanks for the reply. (1) The truncation rate, I use, is the default setting, 0.86. Will it cause a big result difference with 0.869? (2) The training script, you provide, show the training epoch is 100, is that enough? And how many gpus do you use for training ImageNet? (3) The process of the training set for calculate FID: a. only resize each image to 256256; b. resize the shot edge of the image to 256 and center crop it to 256256, which do you use? (4) How to sample test images? Do you sample the image of the same number (50) for each class? I really want to figure out the key point causing the result difference. It will be very appreciated to get the reply from you! Thanks!

cientgu commented 2 years ago

(1) I think it will only have a slight difference. (2) we only trained it for 100 epochs, I'm not sure if more epochs will improve the performance. (3) we should use "ImageNetTransformerPreprocessor" in "https://github.com/cientgu/VQ-Diffusion/blob/main/image_synthesis/data/utils/image_preprocessor.py", however, we use "DalleTransformerPreprocessor", it seems to be our mistake, sorry about it, and currently we don't know how much it affect the results. (4) Yes, you are right. Besides, our follow-up work "Improved VQ-Diffusion" greatly improves the performance on ImageNet, and we have released the pretrained model, maybe it can help you.

guyuchao commented 2 years ago

I met the same problem. I test the pre-trained checkpoint in "Improved VQ-Diffusion" and only achieved the 20.4958 FID. I use the script provided in 'inference_VQ_Diffusion.py' in microsoft/VQ-diffusion.

VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_imagenet.yaml', path='OUTPUT/pretrained_model/imagenet_pretrained.pth') VQ_Diffusion_model.inference_generate_sample_with_class(407, truncation_rate=0.86, save_root="RESULT", batch_size=4)