How to use this benchmark to evaluate other models, such as SDXL and SD3-medium?

Karine-Huang commented 1 week ago

To use this benchmark to evaluate other models, such as SDXL and SD3-medium, follow these steps:

Generate Images:

Generate 10 images for each prompt across all categories.
You can refer to the inference_eval.py script to generate images, fix the seed, and save them in the formatted image files. Example directory structure for saving the images:

color/samples/
    ├── a green bench and a blue bowl_000000.png
    ├── a green bench and a blue bowl_000001.png
    └──...

Evaluation:

Refer to the "Evaluation" section under "Example usage" in the Readme to evaluate the model's performance across different categories.
Use the following tools for evaluation:
- BLIP-VQA for attribute binding (color, shape, texture).
- Unidet for spatial relationships (2D and 3D) and numeracy.
- CLIPScore for non-spatial relationships.
- 3-in-1 for complex compositions.
- MLLM evaluation for all categories (optional).

I hope this helps! Let me know if you need further assistance.

YuehengLuo commented 6 days ago

Hi, I would like to know how to evaluate various metrics if I want to use a generative model like sd1.5 should I use color_val.txt to generate 3000 images and then use bash BLIPvqa_eval/test.sh to get a score that is Attribute color? And then the test Attribute Shape has to be generated using Shape_val.txt？ I mean when I want to reproduce the corresponding metrics, I should use the corresponding val.txt to generate the test image, right?

Karine-Huang commented 5 days ago

Yes, you are correct. To evaluate various metrics for a generative models, use the corresponding val.txt files to generate the test images for each category.

YuehengLuo commented 5 days ago

Thank you. I get it！

Karine-Huang / T2I-CompBench

How to use this benchmark to evaluate other models, such as SDXL and SD3-medium? #18