IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.75k stars 684 forks source link

Code to reproduce evaluation results on LVIS Minival #147

Closed rohit901 closed 6 months ago

rohit901 commented 1 year ago

Hello,

could you please give the code to reproduce the results on LVIS minival?

Thanks,

SlongLiu commented 1 year ago

Thanks for your message. I provide a code for zero-shot COCO evaluation, which can be referred to.

rohit901 commented 1 year ago

Thank you for updating with the code, I will try to refer to it and check it out. Will report back if I face any issues.

rohit901 commented 1 year ago

Hi @SlongLiu, I had one doubt after going through the evaluation code. Here you are not using any "BOX_THRESHOLD" or "TEXT_THRESHOLD" to filter out low scoring boxes, and only selecting the top 300 scoring boxes for evaluation purposes.

My question is, won't it lead to lot of noisy boxes with low scores if we do not filter boxes by thresholds? Visualizing or plotting this would lead to highly noisy boxes with lot of false positives.

Could you clarify?

SlongLiu commented 1 year ago

Hi @SlongLiu, I had one doubt after going through the evaluation code. Here you are not using any "BOX_THRESHOLD" or "TEXT_THRESHOLD" to filter out low scoring boxes, and only selecting the top 300 scoring boxes for evaluation purposes.

My question is, won't it lead to lot of noisy boxes with low scores if we do not filter boxes by thresholds? Visualizing or plotting this would lead to highly noisy boxes with lot of false positives.

Could you clarify?

Thanks for your questions.

The evaluator assigns positive and negative samples for evaluation based on the order of scores. Hence more predictions with low scores won't harm the final performance. Refer to the definition of mAP in COCO for more details.

We set the "BOX_THRESHOLD" or "TEXT_THRESHOLD" for applications only, as we cannot handle 300 boxes in usage.

rohit901 commented 1 year ago

Thank you so much @SlongLiu for your response and for the explanation. I thought we should evaluate the results with filtering of noisy or low quality boxes first by thresholds and NMS.

So I'm splitting the 1203 LVIS classes into subset of 81 classes at a time and passing as the text prompt and then finally trying to concatenate all the results from the different subsets.. I hope there are no bugs/errors in my code [using Detectron2 library for the evaluation on LVIS].. I'm getting AP of 20.272, which is much better than my previous results of ~1 AP with high BOX/Text thresholds and NMS.. what do you think would be the reason for slight difference in performance from the results in the paper? Do you feel there could be a bug/error in my code or is it because of the way I split the classes?

Also, for more details regarding this evaluation procedure, should I read the MS-COCO research paper? could you please give me the reference/link which I can read to understand?

Thanks,

Mukil07 commented 6 months ago

@rohit901 Were you able to evaluate on LVIS dataset ?

rohit901 commented 6 months ago

@Mukil07 Yes, please check my recent project on GitHub and please star the repo as well. https://github.com/rohit901/cooperative-foundational-models

hnanacc commented 3 months ago

Thank you so much @SlongLiu for your response and for the explanation. I thought we should evaluate the results with filtering of noisy or low quality boxes first by thresholds and NMS.

So I'm splitting the 1203 LVIS classes into subset of 81 classes at a time and passing as the text prompt and then finally trying to concatenate all the results from the different subsets.. I hope there are no bugs/errors in my code [using Detectron2 library for the evaluation on LVIS].. I'm getting AP of 20.272, which is much better than my previous results of ~1 AP with high BOX/Text thresholds and NMS.. what do you think would be the reason for slight difference in performance from the results in the paper? Do you feel there could be a bug/error in my code or is it because of the way I split the classes?

Also, for more details regarding this evaluation procedure, should I read the MS-COCO research paper? could you please give me the reference/link which I can read to understand?

Thanks,

@rohit901 I have noticed that the size of chunk (no. of classnames in one pass) affects the results. The larger the chunk size, the better is the result. I guess this has something to do with text-to-image cross attention and query selection. @SlongLiu can you please confirm this?