IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.89k stars 697 forks source link

What are the labels for contrasting losses? #225

Open ws1hope opened 1 year ago

ws1hope commented 1 year ago

Hello authors, thank you very much for your excellent work. I am trying to train Grounding Dino, but have some questions about the contrast loss mentioned in section 3.5 of the text. My understanding is that "predicted_objects" is the output of the decoder, and "language tokens" is the "encoded_text" in text_dict, but I haven't figured out what the ground truth is supposed to be, can you help me with this?

hhaAndroid commented 1 year ago

https://github.com/IDEA-Research/GroundingDINO/issues/228