Reproduce Issue on COCO 2017 Validation Set

AlonzoLeeeooo commented 1 year ago

Hi,

Thanks for the nice work! But during the evaluation of your provided model on hugging face, I found that I could not reproduce your provided FID and CLIP score. Also, the quantitative results of PITI, Stable Diffusion are also worse than the reported ones. Currently, I am using the hyperparameters setting that you provide in the tutorial of your GitHub repo. Could you please provide your hyperparameters setting for evaluation on COCO 2017 validation set? Thanks in advance!

Best regards, Chang

AlonzoLeeeooo commented 1 year ago

Hi,

Thanks for the nice work! But during the evaluation of your provided model on hugging face, I found that I could not reproduce your provided FID and CLIP score. Also, the quantitative results of PITI, Stable Diffusion are also worse than the reported ones. Currently, I am using the hyperparameters setting that you provide in the tutorial of your GitHub repo. Could you please provide your hyperparameters setting for evaluation on COCO 2017 validation set? Thanks in advance!

Best regards, Chang

P.S. I am using the implementation of https://github.com/mseitzer/pytorch-fid to calculate FID score, and using the official implementation of torchmetrics for CLIP score.

MC-E commented 1 year ago

Hi, we use the open CLIP to calculate the CLIP score. You can refer to https://github.com/TencentARC/T2I-Adapter/issues/57

AlonzoLeeeooo commented 1 year ago

Hi, we use the open CLIP to calculate the CLIP score. You can refer to #57

Hi @MC-E ,

Thanks for replying. I would try reproducing those metrics again.

Best,

AlonzoLeeeooo commented 1 year ago

Hi, we use the open CLIP to calculate the CLIP score. You can refer to #57

Hi @MC-E ,

By the way, could you please tell me which version of Open CLIP model are you using? Is it ViT-H-14 or ViT-B-32, or others? Thanks so much!

Best,

AlonzoLeeeooo commented 1 year ago

Hi @MC-E ,

Thank you for providing your evaluation code. I have re-implemented the evaluation on COCO 2017 validation set, and the tested CLIP score is slightly better than before. But the evaluated result is still worse than your reported ones, where the FID is 21.72 and the CLIP score is 0.2597. Besides, I also find that the CLIP score of SD (version v1.4) is better than your reported one, which is 0.2673 compared to 0.2648. For the hyperparameters setting, I follow the recommended setup in your GitHub repo instructions. Could you please provide your parameters setting during the evaluation of COCO 2017 validation set? Or is there any parameter that I am not configuring right?

Thank you in advance for replying in your busy schedule. Hope everything goes well with you!

Best,

ShihaoZhaoZSH commented 1 year ago

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

AlonzoLeeeooo commented 1 year ago

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH ,

As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations.

Best,

ShihaoZhaoZSH commented 1 year ago

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH ,

As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations.

Best,

Thanks for your reply. There are 6 captions for each image in the validation set. So do you mean you run all the 6*5=30k text-image pairs? or just randomly pick up one caption for each image and test on 5k text-image pairs.

AlonzoLeeeooo commented 1 year ago

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH , As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations. Best,

Thanks for your reply. There are 6 captions for each image in the validation set. So do you mean you run all the 6*5=30k text-image pairs? or just randomly pick up one caption for each image and test on 5k text-image pairs.

For each image sample, I randomly pick one as the corresponding text prompt. For a public dataset on such a scale, it should be able to measure the generation ability of the evaluated model properly.

ShihaoZhaoZSH commented 1 year ago

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH , As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations. Best,

Thanks for your reply. There are 6 captions for each image in the validation set. So do you mean you run all the 6*5=30k text-image pairs? or just randomly pick up one caption for each image and test on 5k text-image pairs.

For each image sample, I randomly pick one as the corresponding text prompt. For a public dataset on such a scale, it should be able to measure the generation ability of the evaluated model properly.

got it. I met the same problem with you. my reproduced FID results are also greater than 20 on both seg+text and sketch+text settings.

AlonzoLeeeooo commented 1 year ago

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH , As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations. Best,

Thanks for your reply. There are 6 captions for each image in the validation set. So do you mean you run all the 6*5=30k text-image pairs? or just randomly pick up one caption for each image and test on 5k text-image pairs.

For each image sample, I randomly pick one as the corresponding text prompt. For a public dataset on such a scale, it should be able to measure the generation ability of the evaluated model properly.

got it. I met the same problem with you. my reproduced FID results are also greater than 20 on both seg+text and sketch+text settings.

How is your reproduced CLIP score? Actually, the reproduced FID score could be reasonable due to some possible fluctuations on different devices. But my reproduced CLIP score is obviously worse than the reported one, which is 0.2597 (reproduced) compared to 0.2673 (reported).

ShihaoZhaoZSH commented 1 year ago

@AlonzoLeeeooo There are 5k images and 30k (text, image) pairs in COCO val2017. How many (image, text) pairs do you test on? 5k or 30k?

Hi @ShihaoZhaoZSH , As is reported in the paper, the evaluation is implemented on the validation set with 5k images. Note that the validation set with 5k images also has official caption annotations. Best,

Thanks for your reply. There are 6 captions for each image in the validation set. So do you mean you run all the 6*5=30k text-image pairs? or just randomly pick up one caption for each image and test on 5k text-image pairs.

For each image sample, I randomly pick one as the corresponding text prompt. For a public dataset on such a scale, it should be able to measure the generation ability of the evaluated model properly.

got it. I met the same problem with you. my reproduced FID results are also greater than 20 on both seg+text and sketch+text settings.

How is your reproduced CLIP score? Actually, the reproduced FID score could be reasonable due to some possible fluctuations on different devices. But my reproduced CLIP score is obviously worse than the reported one, which is 0.2597 (reproduced) compared to 0.2673 (reported).

Sorry that for the main line, I used the anything pretrained weights. Now I turn to the stable diffusion weights and the FID is lower than 20. But the clip score is still lower than 0.26.

YibooZhao commented 12 months ago

I have a stupid question. When calculating clip score, is it right to calculate the clip scores of all coco2017val image text pairs and then average them,？and what are the negative_prompt required for generation？

dmmSJTU commented 10 months ago

@AlonzoLeeeooo Hi, Could you please add your WeChat and ask some questions about training? my email is dmm2020@sjtu.edu.cn.

AlonzoLeeeooo commented 10 months ago

@AlonzoLeeeooo Hi, Could you please add your WeChat and ask some questions about training? my email is dmm2020@sjtu.edu.cn.

I have sent my WeChat number to you through e-mail.

AlonzoLeeeooo commented 10 months ago

I have a stupid question. When calculating clip score, is it right to calculate the clip scores of all coco2017val image text pairs and then average them,？and what are the negative_prompt required for generation？

I remembered that there is an issue that illustrates this setting (not sure if it is using the first caption of each image). You may have a check.

TencentARC / T2I-Adapter

Reproduce Issue on COCO 2017 Validation Set #62