Karine-Huang / T2I-CompBench

[Neurips 2023] T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
https://arxiv.org/pdf/2307.06350.pdf
MIT License
168 stars 5 forks source link

Question on reproduce the results #10

Closed TianyunYoung closed 7 months ago

TianyunYoung commented 7 months ago

Hi~

Sorry to bother. I try to reproduce the results in Table2 and Table3 using the released weights in ./GORS_finetune/checkpoint. However, the result seems not match the reported results.

My reproduce procedure is as follows:

  1. Use GORS_finetune/inference_eval.py to generate images in "color_val.txt", "shape_val.txt", and "texture_val.txt"
  2. Use BLIPvqa_eval/BLIP_vqa.py to calculate VQA results on generated images.
  3. Calculate averaged score in "annotation_blip/vqa_result.json", which contains 3000 scores per test file.

Is the procedure correct? Thank you very much~

Karine-Huang commented 7 months ago

Hello~

Not a bother at all! Your procedure seems generally correct, but there might be a couple of things to double-check:

Ensure that you've replaced the "pretrained_model_path" with the correct path to the checkpoint (ckpt) in the inference code.

It's essential to note that variations in servers, graphics cards, and drivers may introduce some deviations in results. As long as the discrepancies are within a reasonable range, they can be considered acceptable.

Feel free to let me know if you have further questions!

Karine-Huang commented 7 months ago

hello! Thanks for bringing this to our attention. The version of 'diffusers' can impact result reproduction. To ensure consistent results, we recommend specifying 'diffusers' version 0.15.0.dev0.

You can switch to this version with the following command: pip install diffusers==0.15.0.dev0

or install from the provided source of diffusers.zip

For more detailed instructions, please refer to the Readme.md file. Hope that helps!

TianyunYoung commented 7 months ago

Thanks for your kind and detailed response. I will try as you recommend. 😊

TianyunYoung commented 7 months ago

Hi~ Thanks for your guidance.

After using the provided diffusers.zip, the reproduced results did matches the reported results😊!

It is still a little bit strange why diffusers's version would influence that much. At the begining, the 'diffusers' version I used is 0.24.0, the B-VQA result for color_val.txt is about 0.55. However, after changing the version to 0.15.0.dev0, the result improves a lot and becomes 0.66.

Do you have any clue about it? Thank you very much~

Karine-Huang commented 7 months ago

Hello! The variation in results based on different versions of packages, including diffusers, is not uncommon in software development. Here are some possible reasons: Diffusers may have dependencies on other Python packages. Changes in the versions of these dependencies can also impact the overall behavior of the diffusers. The code used within the diffusers package may have modifications between version 0.24.0 and 0.15.0.dev0. These changes could have affected the behavior of the diffusers, ultimately influencing the results. It's worth noting that maintaining consistency in the version used for training is reasonable, as it helps control for variations introduced by different versions.

TianyunYoung commented 7 months ago

Thanks for your reply. The explanation is reasonable.