Some questions about the details in paper

IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

Apache License 2.0

6.74k stars 683 forks source link

To the Authors

Very nice job! But I just have some questions when I read the paper: 1) you concatenate all category names as input texts for object detection tasks, but how to deal with the texts when use REC dataset for training, concatenate all phrase as detection tasks? or just use one text once? 2) when training Grounding DINO, is it same as DINO except for classfication? Does it use CDN and Look Forward Twice, which used in DINO? 3) GLIP trained on detection and grounding dataset, then generate pseudo labels for caption datasets, finally train on these three kinds of datasets again. Is Grounding DINO same as it? Does Grounding DINO finally train on all datasets using pretrained DINO and freeze revelant parameters?

Thank you and looking forward to your reply.

IDEA-Research / GroundingDINO

Some questions about the details in paper #224