IDEA-Research / Grounded-SAM-2

Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2
https://arxiv.org/abs/2401.14159
Apache License 2.0
773 stars 61 forks source link

Output class names return ungiven classes #50

Open CYBruce opened 5 days ago

CYBruce commented 5 days ago

Following the clearly-written README, I implemented the model successfully. However, for my cases, I found some problems. I used the code grounded_sam2_local_demo.py and the prompt is "car . bike . people . parking sign . parking entrance sign ." But the return json file give some ungiven classes such as "sign entrance sign", "entrance sign"(seems like combinations of prompt words). And sometimes, void class name "" is output.

image

I want to ask if only classes in the prompt will be labeled by the model. If the question is true, where are the results like "sign entrance sign" coming from? Is this problem related to BOX_THRESHOLD and TEXT_THRESHOLD parameters in the code?

rentainhe commented 2 days ago

@CYBruce Sorry for the late reply, Grounding DINO will combine the text with confidence score larger than the text threshold for each box, this means we will meet some combined words in Grounding DINO model, to avoid this, you can modify the local code refer to here: https://github.com/IDEA-Research/GroundingDINO?tab=readme-ov-file#arrow_forward-demo by specifying the phrases.