IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
5.81k stars 613 forks source link

Some questions about the details in paper #224

Open yangyuya opened 10 months ago

yangyuya commented 10 months ago

To the Authors

Very nice job! But I just have some questions when I read the paper: 1) you concatenate all category names as input texts for object detection tasks, but how to deal with the texts when use REC dataset for training, concatenate all phrase as detection tasks? or just use one text once? 2) when training Grounding DINO, is it same as DINO except for classfication? Does it use CDN and Look Forward Twice, which used in DINO? 3) GLIP trained on detection and grounding dataset, then generate pseudo labels for caption datasets, finally train on these three kinds of datasets again. Is Grounding DINO same as it? Does Grounding DINO finally train on all datasets using pretrained DINO and freeze revelant parameters?

Thank you and looking forward to your reply.

SlongLiu commented 10 months ago

Thanks for your questions!

  1. We use MDETR-processed REC, i.e., extract a noun phrase with spacy.
  2. yes, I think so. the base detector is exactly the DINO.
  3. the final models are trained from scratch (imgnet-pretrained backbone). We add an experiment of training from DINO is for ablation only.