Very nice job! But I just have some questions when I read the paper:
1) you concatenate all category names as input texts for object detection tasks, but how to deal with the texts when use REC dataset for training, concatenate all phrase as detection tasks? or just use one text once?
2) when training Grounding DINO, is it same as DINO except for classfication? Does it use CDN and Look Forward Twice, which used in DINO?
3) GLIP trained on detection and grounding dataset, then generate pseudo labels for caption datasets, finally train on these three kinds of datasets again. Is Grounding DINO same as it? Does Grounding DINO finally train on all datasets using pretrained DINO and freeze revelant parameters?
To the Authors
Very nice job! But I just have some questions when I read the paper: 1) you concatenate all category names as input texts for object detection tasks, but how to deal with the texts when use REC dataset for training, concatenate all phrase as detection tasks? or just use one text once? 2) when training Grounding DINO, is it same as DINO except for classfication? Does it use CDN and Look Forward Twice, which used in DINO? 3) GLIP trained on detection and grounding dataset, then generate pseudo labels for caption datasets, finally train on these three kinds of datasets again. Is Grounding DINO same as it? Does Grounding DINO finally train on all datasets using pretrained DINO and freeze revelant parameters?
Thank you and looking forward to your reply.