Dense Captioning Model for Attribute Binding Eval?

yinanyz commented 1 year ago

Thanks for the great work! I noticed that in the paper you mentioned that

"We observe that the major limitation of the BLIP-CLIP evaluation is that the BLIP captioning models do not always describe the detailed attributes of each object. For example, the BLIP captioning model might describe an image as “A room with a table, a chair, and curtains”, while the text prompt for generating this image is “A room with yellow curtains and a blue chair”. So explicitly comparing the text-text similarity might cause ambiguity and confusion."

I'm wondering if you've considered using some dense captioning model (e.g. GRiT, which I believed is what used by LLMScore: "GRiT is a model pre-trained with detection and dense caption objectives jointly on the Visual Genome dataset, which contains local fine-grained descriptions for objects in the image.").

Since you mentioned that "We empirically find that the Disentangled BLIP-VQA works best for attribute binding evaluation, UniDet-based metric works best for spatial relationship evaluation", I wonder if you think using some model like GRiT would lead to a better or more unified eval metric here?

Karine-Huang commented 1 year ago

Thank you for the kind words! While dense caption models like GRiT excel at comprehending objects through rich descriptive sentences, they encounter difficulties in attending to specific attributes that are of importance to us. For instance, when presented with a prompt that contains texture description, such as "a metallic desk lamp and a fluffy sweater," dense caption models may provide a detailed account of the generated image (describe almost every object in the image) but may fall short in accurately describing it using texture-related terms.

yinanyz commented 1 year ago

Thanks for your reply! That makes a lot of sense that the dense caption models might still miss the descriptive words for the texture attributes, but wondering if you think they'll work for color attributes?

Karine-Huang commented 11 months ago

Thank you for your follow-up question! Dense caption models, while may be capable of recognizing color attributes, don't consistently generate color-related descriptive words. Moreover, they might provide detailed information on aspects we might not be particularly concerned about, leading to a potential mismatch with the specific attributes we're focusing on. Fine-tuning the model to better emphasize color-related features could be a promising avenue to address this issue.

Karine-Huang / T2I-CompBench

Dense Captioning Model for Attribute Binding Eval? #2