Hello authors, thank you very much for your excellent work.
I am trying to train Grounding Dino, but have some questions about the contrast loss mentioned in section 3.5 of the text.
My understanding is that "predicted_objects" is the output of the decoder, and "language tokens" is the "encoded_text" in text_dict, but I haven't figured out what the ground truth is supposed to be, can you help me with this?
Hello authors, thank you very much for your excellent work. I am trying to train Grounding Dino, but have some questions about the contrast loss mentioned in section 3.5 of the text. My understanding is that "predicted_objects" is the output of the decoder, and "language tokens" is the "encoded_text" in text_dict, but I haven't figured out what the ground truth is supposed to be, can you help me with this?