Issues Differentiating Between "Fallen Person" and "Person" in Zero-shot Object Detection

Hello

I've been experimenting with the GroundingDINO model for a vision+text multi-modal input zero-shot object detection task, and I've encountered a specific issue that I hope to get some guidance on.

The objective of my project is to distinguish between "fallen persons" and individuals who are not in a fallen state. However, when I input the text "fallen person" into the model, it seems that the model heavily weights the term "person" and consistently identifies individuals as "person," overlooking the "fallen" aspect which is crucial for my use case.

I understand that GroundingDINO is designed for zero-shot object detection, leveraging both visual and textual cues for identification. Given this, I expected that the model would be able to differentiate between these two contexts ("fallen person" vs. "person") based on the textual input provided.

Could you please advise on any potential adjustments or methodologies that I might employ to enhance the model's ability to distinguish between these two scenarios? Are there specific parameters or model fine-tuning techniques that could help the model better understand and differentiate based on the nuanced context provided in the textual input?

Any suggestions or guidance on this matter would be greatly appreciated, as this distinction is vital for the objectives of my project.

Thank you very much for your time and assistance.

Best regards,

IDEA-Research / GroundingDINO

Issues Differentiating Between "Fallen Person" and "Person" in Zero-shot Object Detection #315