Problem about the score threshold

AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection

https://www.yoloworld.cc

GNU General Public License v3.0

4.21k stars 411 forks source link

Problem about the score threshold #237

Open chch9907 opened 4 months ago

chch9907 commented 4 months ago

Very nice work! I intend to use your work in my project. However, I have a confusion. I find that your score threshould value in both https://github.com/AILab-CVC/YOLO-World/blob/5ee2e01d52fd02aa75f731b145244ed204781d0f/demo.py#L182 and https://github.com/AILab-CVC/YOLO-World/blob/5ee2e01d52fd02aa75f731b145244ed204781d0f/inference.ipynb#L3444 are set to 0.05, which seems to be very low compared to those used in Detic and GroundingDINO. And when I test the online demo by using my text prompt shown as follows, I find the target can be successfully detected but with very low score. It is convincing that the inference speed is extremely fast and the text grounding capability is strong. But I wonder how the default score threshould (0.05) is defined, which may be related to the right way to use the model. Thanks!

query prompts: "pillar, snowman" test_yolow result of yolo-world demo with modifying score threshould to 0.03 test_groundingdino result of GroundingDINO demo

wondervictor commented 4 months ago

Maybe you could try to add a padding class " ", such as ["pillar", "snowman", " "], the low-confidence problem is due to the padding during pre-training and hasn't been well fixed yet.

chch9907 commented 4 months ago

Thank you for your reply! I try your advice on the demo shown below as well as add "" to each category, but seems no change. test_yolow2

Indeed the detected bboxs are comparable to those of GroundingDINO. But sometimes the scores become low. Anyway I will try demo again and then try your code. Thank you!

wufei-png commented 4 months ago

Thank you for your reply! I try your advice on the demo shown below as well as add "" to each category, but seems no change.

Indeed the detected bboxs are comparable to those of GroundingDINO. But sometimes the scores become low. Anyway I will try demo again and then try your code. Thank you!

maybe "snowman,pillar, ,"?

wondervictor commented 4 months ago

Hi @chch9907, could you share this sample image?

chch9907 commented 4 months ago

Thank you for your reply! I try your advice on the demo shown below as well as add "" to each category, but seems no change. Indeed the detected bboxs are comparable to those of GroundingDINO. But sometimes the scores become low. Anyway I will try demo again and then try your code. Thank you!

maybe "snowman,pillar, ,"?

Thank you for your reply. Still no difference on demo.

chch9907 commented 4 months ago

Hi @chch9907, could you share this sample image?

Emmm maybe I will send it to you through email.

wondervictor commented 4 months ago

@chch9907 tianhengcheng#gmail.com

chch9907 commented 4 months ago

@wondervictor has been sent to you.

wondervictor commented 4 months ago

Hi @chch9907, I adopted the X-1280 model with the threshold of 0.25 and obtained the results. Maybe you can try out the finetuned high-resolution models. We are working on the low-confidence problem though YOLO-World can exactly detect those objects.

chch9907 commented 4 months ago

Hi @wondervictor, thank you for your reply! X-1280 seems to work well. Yes I understand. I will try high-resolution models first. Thanks!