Karine-Huang / T2I-CompBench

[Neurips 2023] T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
https://arxiv.org/pdf/2307.06350.pdf
MIT License
168 stars 5 forks source link

Question about scores reported in Table 1, Table2, and Table3 in the papers. #7

Closed JamesSand closed 7 months ago

JamesSand commented 9 months ago

Hey,

Thanks for the great work. As for the number in table1, table2, and table3 in the paper, which namely are the scores of color, shape, and texture attribute binding tested by BLIP-VQA model.

Could you please tell me the way to reproduce these numbers? For which prompts I should use and how many images should I generate for each prompt?

I have tried to use the total 1000 prompts provided in color.txt, shape.txt, and texture.txt separately to generate 1 image for each prompt, and then I use the evaluation shell to calculate the score. However, the scores I got were 3 points higher than the score provided in the paper. I don't know where I have done wrong...

I have read the appendix of the paper, and the supplement materials of the paper, but I didn't find any clue to reproduce the score.

I have also tried very hard to find the email address of the first author if this paper, but eventually I failed.

Could anyone help me with this?

Karine-Huang commented 7 months ago

Thank you for your interest in our work! Regarding the evaluation section, in Section 3 of the paper, it is mentioned, "We generate 1,000 text prompts (700 for training and 300 for testing) for each sub-category, resulting in 6,000 compositional text prompts in total." This implies a 7:3 split for the training set and testing set, with the evaluation focused on the testing set. The testing set corresponds to files in the dataset with the suffix "_val.txt", while files with the suffix "_train.txt" constitute the training set.

In Section 6.2 of the paper, it is stated, "We generate 10 images for each text prompt in T2I-CompBench for automatic evaluation." This means evaluating the testing set for each of the six categories (300 prompts/category 6 categories). To reduce randomness in generation, we produce 10 images for each prompt with different seeds. Therefore, for each category, a total of 300 10 = 3000 images are generated for evaluation.

If you have any further questions or need additional discussion, please feel free to contact us anytime. (We have updated the email address in the paper.)