AILab-CVC / YOLO-World

[CVPR 2024] Real-Time Open-Vocabulary Object Detection
https://www.yoloworld.cc
GNU General Public License v3.0
4.21k stars 411 forks source link

Problem about the score threshold #237

Open chch9907 opened 4 months ago

chch9907 commented 4 months ago

Very nice work! I intend to use your work in my project. However, I have a confusion. I find that your score threshould value in both https://github.com/AILab-CVC/YOLO-World/blob/5ee2e01d52fd02aa75f731b145244ed204781d0f/demo.py#L182 and https://github.com/AILab-CVC/YOLO-World/blob/5ee2e01d52fd02aa75f731b145244ed204781d0f/inference.ipynb#L3444 are set to 0.05, which seems to be very low compared to those used in Detic and GroundingDINO. And when I test the online demo by using my text prompt shown as follows, I find the target can be successfully detected but with very low score. It is convincing that the inference speed is extremely fast and the text grounding capability is strong. But I wonder how the default score threshould (0.05) is defined, which may be related to the right way to use the model. Thanks!

query prompts: "pillar, snowman" test_yolow result of yolo-world demo with modifying score threshould to 0.03 test_groundingdino result of GroundingDINO demo

wondervictor commented 4 months ago

Maybe you could try to add a padding class " ", such as ["pillar", "snowman", " "], the low-confidence problem is due to the padding during pre-training and hasn't been well fixed yet.

chch9907 commented 4 months ago

Thank you for your reply! I try your advice on the demo shown below as well as add "" to each category, but seems no change. test_yolow2

Indeed the detected bboxs are comparable to those of GroundingDINO. But sometimes the scores become low. Anyway I will try demo again and then try your code. Thank you!

wufei-png commented 4 months ago

Thank you for your reply! I try your advice on the demo shown below as well as add "" to each category, but seems no change. test_yolow2

Indeed the detected bboxs are comparable to those of GroundingDINO. But sometimes the scores become low. Anyway I will try demo again and then try your code. Thank you!

maybe "snowman,pillar, ,"?

wondervictor commented 4 months ago

Hi @chch9907, could you share this sample image?

chch9907 commented 4 months ago

Thank you for your reply! I try your advice on the demo shown below as well as add "" to each category, but seems no change. test_yolow2 Indeed the detected bboxs are comparable to those of GroundingDINO. But sometimes the scores become low. Anyway I will try demo again and then try your code. Thank you!

maybe "snowman,pillar, ,"?

Thank you for your reply. Still no difference on demo.

chch9907 commented 4 months ago

Hi @chch9907, could you share this sample image?

Emmm maybe I will send it to you through email.

wondervictor commented 4 months ago

@chch9907 tianhengcheng#gmail.com

chch9907 commented 4 months ago

@wondervictor has been sent to you.

wondervictor commented 4 months ago
截屏2024-04-11 17 17 40

Hi @chch9907, I adopted the X-1280 model with the threshold of 0.25 and obtained the results. Maybe you can try out the finetuned high-resolution models. We are working on the low-confidence problem though YOLO-World can exactly detect those objects.

chch9907 commented 4 months ago

Hi @wondervictor, thank you for your reply! X-1280 seems to work well. Yes I understand. I will try high-resolution models first. Thanks!