IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.19k stars 651 forks source link

Many false positives #84

Open RoelofKuipers opened 1 year ago

RoelofKuipers commented 1 year ago

Grounding DINO can label true positives very well. But if none of the prompted classes is present in the image, I always end up with false positives. Often with high confidence. What can be the reason for this? And is there anything that I can do about this? Or is it always necessary to know that one of the classes is present in the image? That seems counter-intuitive.

See the two examples below, where the prompted classes were ['pipe','excavator'].

image image

aixiaodewugege commented 1 year ago

Same problem for me. image TEXT_PROMPT = "tree . person . grass"

image TEXT_PROMPT = "bear."

RoelofKuipers commented 1 year ago

Any updates?

ozymand1a commented 1 year ago

I have same problem.

zhengziqiang commented 1 year ago

I think Grounding DINO does not learn the semantics of information. It only has learned the matching ability. I have tested two cases.

The common zebra image: pred pred

The unusual image: elephant fish

Pay attention to the confidence score for the detected bbox.

aixiaodewugege commented 1 year ago

If so, how do they achieve such high scores in the COCO dataset?

zhengziqiang commented 1 year ago

If so, how do they achieve such high scores in the COCO dataset?

I think it is attributed to that the model is given the right text prompt. It matches the detected bbox and the text prompt. If the model is given the wrong the text prompt, it may still generate the similar the bbox output. I feel like it seems like a bounding box generator (like RPN). You could take some experiments to demonstrate whether it is true. I will take some tests. If there is some result, will let you know,

zhengziqiang commented 1 year ago

I have taken some tests. Grounding DINO seems to do the similarity matching between the text prompts and the detected bbox. It is easy to be cheated. Take AP50 as an example, there will be many false positives.

Some testing images: zebra_real monkey zebra_new

Human case: person diver human motor driver

Pay attention to the condifence score comparison.

quanzzz123 commented 1 year ago

In my opinion, false positive is caused by the model structure, specifically the Feature Enhancer. The query selection process uses the 'enhanced' image and text feature which are already crossmodal-fusion. If the image and text info are matched, everthing is fine(during training, there are all positive pairs). But if we give one negative text, the feature enhance will also make fusion on those two unrelated feat and make query selection through the 'enhanced' features. Thus the result is not so confident. To prevent this false positive case, a pure query selection based on image and text encoder's outputs(or in other words, contrastive loss) is needed, like CLIP's.

usama-axcelerate commented 1 year ago

Any update? Did any one find a solution for False positives?