why sDCI only have summary on 30% mask of a image?

SunzeY commented 8 months ago

Hi, I check the dataset. As each image have around 30-40 masks with a long caption. However, only mask with bounding box larger than [224, 224] and good quality tag have a summary. Only these masks are tested in your evaluation. Could you please explain why this is the case? I believe it is closer to "Dense" as claimed in the paper to test proposed SCM to match all possible masks of a image with their captions.

JackUrb commented 8 months ago

Hi @SunzeY, this is a great question. Generally I agree with you, however we found in our early testing (using all crops containing captions under 77 tokens, including these) that existing clip-based models generally were performing notably worse on captions for smaller crops of the image.

As such, we chose to filter out masks below that size threshold, however their context ends up being included for the summary of their parent mask instead.

Masks marked as low quality are a signal about a failure of the SAM model, rather than something we expect a model performing SCM to handle.

One possible extension of our dataset (and moving away from sDCI) would be to have models that can process <full image, pixel mask, caption> triples, however this wasn't particularly useful for evaluating todays models.

SunzeY commented 8 months ago

Thank you for your reply! I totally agree with you and believe improving region recognition, especially on small object, is a good research topic in the future. Do you have plan to officially make summaries for all masks available? By the way, I believe this recent work TAP can work pretty well in your evaluation.

JackUrb commented 8 months ago

We won't produce summaries for the smaller masks, but you can use the underlying dataset to use the human annotated text label instead.

TAP seems pretty relevant, it would definitely make sense to test it in the triples setting I outlined above

SunzeY commented 8 months ago

Thank you for your kind reply! :)

facebookresearch / DCI

why sDCI only have summary on 30% mask of a image? #4