IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.9k stars 699 forks source link

Some questions on the details in paper #1

Open Dwrety opened 1 year ago

Dwrety commented 1 year ago

To the Authors

This is a very interesting and good work on visual grounding tasks with a Query-based detector. The paper is also well written and clear. Super interesting results with GLIGEN as well. I do have a few very specific questions about the implementation or concepts in the paper.

  1. As for the language-guided query selection. This module makes a lot of sense and you are basically saying, you want to extract location of the image tokens where they have the greatest responses with the text tokens. And then use these as the location queries in the Mixed-Query-Selection design in DINO. I notice you describe the outer product between text/img tokens as logits. My questions are (a). Is there any supervision on this level? If not, did you use any pretrained Vison Language initializations so that they naturally responds (b). Does it make more sense to use the normalized feature vectors so that the dot product is actually correlation. (c) what happens if the selected image tokens all have responses to the same text token or only a few text tokens, and if there is any way to separate them out like the 1st-stage training in Deformable DETR or DINO?
  2. As for the Sub-Sentence Level Text Feature. (a). How is the attention mask produced when dealing with a weak annotation such as image-caption pairs (Cap4M), did you take a noun extraction methods as described in DetCLIP? As a detailed example would be, how to generate the attention mask for a concept "fruit fly" or any human names such as "Harry Potter", when the detection dataset doesn't have this category. (b). And how to handle the input length limit as GLIP describes in their paper when you have over 1000 categories like LVIS during training/inference? Was there like a sparse negative category sampling strategy?
  3. Loss Function Is the negative class handled similar to the alignment loss described in GLIP or MDETR? I assume you apply sigmoid focal loss and the negative object queries simply learns the 0 from {0, 1} binary target?
  4. Last but not least, do you think it's possible to leverage other frameworks such as pretrained ALBEF, VLMo or even BeiTv3 and inject your design into it? If not, what do you think are the limitations of these frameworks.

Thank you.

SlongLiu commented 1 year ago

Thanks for your questions. We will provide the demo with GLIGEN soon.

  1. (a) similar to DINO, we will calculate loss after the encoder, which will be supervised for the module. (b) It is a good question. We use the outer products to mimic the linear operation in classification, where no norm is used. I think it is worth trying to measure the influence of norms. (c) I have not deal with the situation you mentioned. But I think a well-trained model can make right responses to text tokens. If selected image tokens all have responses to the same text token, it may mean that objects of other texts do not exist in the image. That still a good question. We will try more corner cases for the model.
SlongLiu commented 1 year ago
  1. (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
  2. loss function is similar to GLIP. More specifically, it is focal loss.
  3. It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
Dwrety commented 1 year ago
  1. (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
  2. loss function is similar to GLIP. More specifically, it is focal loss.
  3. It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.

Thanks for your answer, still looking at your codes. I think this work is amazing.

Dwrety commented 1 year ago

Are the pseudo labels on CC3M and SBU the same as the ones used in GLIP/GLIPv2? Or you generated them yourselves? To my knowledge, Microsoft hasn't yet released their pseudo labels. Will you be releseasing them?

RyanHTR commented 1 year ago
  1. (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
  2. loss function is similar to GLIP. More specifically, it is focal loss.
  3. It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.

Nice work! According to your answer to question 2(b), when evaluating the model on the LVIS dataset, do you mean that you concatenate a part of category names to meet the limit of 256 text tokens and forward an image sample multiple times with different parts of category names, then merge the results as the final detection results of over 1000 categories on LVIS? Thanks!

Dwrety commented 1 year ago
  1. (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
  2. loss function is similar to GLIP. More specifically, it is focal loss.
  3. It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.

Nice work! According to your answer to question 2(b), when evaluating the model on the LVIS dataset, do you mean that you concatenate a part of category names to meet the limit of 256 text tokens and forward an image sample multiple times with different parts of category names, then merge the results as the final detection results of over 1000 categories on LVIS? Thanks!

I am not the author, but I think they followed GLIP. Which is the same inference process as you described.

SlongLiu commented 1 year ago

Are the pseudo labels on CC3M and SBU the same as the ones used in GLIP/GLIPv2? Or you generated them yourselves? To my knowledge, Microsoft hasn't yet released their pseudo labels. Will you be releseasing them?

Thanks for your questions. We use the GLIP-annotated data for training.

SlongLiu commented 1 year ago
  1. (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
  2. loss function is similar to GLIP. More specifically, it is focal loss.
  3. It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.

Nice work! According to your answer to question 2(b), when evaluating the model on the LVIS dataset, do you mean that you concatenate a part of category names to meet the limit of 256 text tokens and forward an image sample multiple times with different parts of category names, then merge the results as the final detection results of over 1000 categories on LVIS? Thanks!

I am not the author, but I think they followed GLIP. Which is the same inference process as you described.

Yes, we follow GLIP. We first run models multiple times with different category names and then merge outputs.

Dwrety commented 1 year ago

Are the pseudo labels on CC3M and SBU the same as the ones used in GLIP/GLIPv2? Or you generated them yourselves? To my knowledge, Microsoft hasn't yet released their pseudo labels. Will you be releseasing them?

Thanks for your questions. We use the GLIP-annotated data for training.

Do you know where to download it? I don't think they released publicly

YifanXu74 commented 1 year ago
  • (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
  • loss function is similar to GLIP. More specifically, it is focal loss.
  • It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.

@SlongLiu Very nice work! According to your answer to question 3, the loss function is similar to GLIP. I notice that GLIP assigns the negative category (background) to the last token of the sentence (maybe [EOS] token). Am I right? Did GroundingDINO use the same strategy?