IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.89k stars 697 forks source link

Incorrect understanding of the text #240

Open RainyLayx opened 1 year ago

RainyLayx commented 1 year ago

When I input the text prompt 'cat with classes.',I wanna get the cat which wear glasses,but the model frames the cat and the glasses as two object. How to deal with it?...

rentainhe commented 1 year ago

When I input the text prompt 'cat with classes.',I wanna get the cat which wear glasses,but the model frames the cat and the glasses as two object. How to deal with it?...

You can try this demo here and set the specific --token_span for cat with glasses to see if it can bring you a better result.

RainyLayx commented 1 year ago

I tried to do like this, but nothing was detected...

rentainhe commented 1 year ago

I tried to do like this, but nothing was detected...

Have you ever tried to lower the threshold to see the results~

RainyLayx commented 1 year ago

0.9 0.1 都试过了,结果都是检测不到任何目标。我在想换用理解力更强的text backbone是否能解决。

sunwoo76 commented 4 months ago

When I input the text prompt 'cat with classes.',I wanna get the cat which wear glasses,but the model frames the cat and the glasses as two object. How to deal with it?...

You can try this demo here and set the specific --token_span for cat with glasses to see if it can bring you a better result.

The model use the hidden state feature of each text tokens. It is different with CLIP which uses hidden state feature of EOS token.