Closed SunzeY closed 8 months ago
Hi @SunzeY, this is a great question. Generally I agree with you, however we found in our early testing (using all crops containing captions under 77 tokens, including these) that existing clip-based models generally were performing notably worse on captions for smaller crops of the image.
As such, we chose to filter out masks below that size threshold, however their context ends up being included for the summary of their parent mask instead.
Masks marked as low quality are a signal about a failure of the SAM model, rather than something we expect a model performing SCM to handle.
One possible extension of our dataset (and moving away from sDCI) would be to have models that can process <full image, pixel mask, caption> triples, however this wasn't particularly useful for evaluating todays models.
Thank you for your reply! I totally agree with you and believe improving region recognition, especially on small object, is a good research topic in the future. Do you have plan to officially make summaries for all masks available? By the way, I believe this recent work TAP can work pretty well in your evaluation.
We won't produce summaries for the smaller masks, but you can use the underlying dataset to use the human annotated text label instead.
TAP seems pretty relevant, it would definitely make sense to test it in the triples setting I outlined above
Thank you for your kind reply! :)
Hi, I check the dataset. As each image have around 30-40 masks with a long caption. However, only mask with bounding box larger than [224, 224] and good quality tag have a summary. Only these masks are tested in your evaluation. Could you please explain why this is the case? I believe it is closer to "Dense" as claimed in the paper to test proposed SCM to match all possible masks of a image with their captions.