Open RoelofKuipers opened 1 year ago
Same problem for me. TEXT_PROMPT = "tree . person . grass"
TEXT_PROMPT = "bear."
Any updates?
I have same problem.
I think Grounding DINO does not learn the semantics of information. It only has learned the matching ability. I have tested two cases.
The common zebra image:
The unusual image:
Pay attention to the confidence score for the detected bbox.
If so, how do they achieve such high scores in the COCO dataset?
If so, how do they achieve such high scores in the COCO dataset?
I think it is attributed to that the model is given the right text prompt. It matches the detected bbox and the text prompt. If the model is given the wrong the text prompt, it may still generate the similar the bbox output. I feel like it seems like a bounding box generator (like RPN). You could take some experiments to demonstrate whether it is true. I will take some tests. If there is some result, will let you know,
I have taken some tests. Grounding DINO seems to do the similarity matching between the text prompts and the detected bbox. It is easy to be cheated. Take AP50 as an example, there will be many false positives.
Some testing images:
Human case:
Pay attention to the condifence score comparison.
In my opinion, false positive is caused by the model structure, specifically the Feature Enhancer. The query selection process uses the 'enhanced' image and text feature which are already crossmodal-fusion. If the image and text info are matched, everthing is fine(during training, there are all positive pairs). But if we give one negative text, the feature enhance will also make fusion on those two unrelated feat and make query selection through the 'enhanced' features. Thus the result is not so confident. To prevent this false positive case, a pure query selection based on image and text encoder's outputs(or in other words, contrastive loss) is needed, like CLIP's.
Any update? Did any one find a solution for False positives?
Grounding DINO can label true positives very well. But if none of the prompted classes is present in the image, I always end up with false positives. Often with high confidence. What can be the reason for this? And is there anything that I can do about this? Or is it always necessary to know that one of the classes is present in the image? That seems counter-intuitive.
See the two examples below, where the prompted classes were ['pipe','excavator'].