Open Dwrety opened 1 year ago
Thanks for your questions. We will provide the demo with GLIGEN soon.
- (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
- loss function is similar to GLIP. More specifically, it is focal loss.
- It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
Thanks for your answer, still looking at your codes. I think this work is amazing.
Are the pseudo labels on CC3M and SBU the same as the ones used in GLIP/GLIPv2? Or you generated them yourselves? To my knowledge, Microsoft hasn't yet released their pseudo labels. Will you be releseasing them?
- (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
- loss function is similar to GLIP. More specifically, it is focal loss.
- It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
Nice work! According to your answer to question 2(b), when evaluating the model on the LVIS dataset, do you mean that you concatenate a part of category names to meet the limit of 256 text tokens and forward an image sample multiple times with different parts of category names, then merge the results as the final detection results of over 1000 categories on LVIS? Thanks!
- (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
- loss function is similar to GLIP. More specifically, it is focal loss.
- It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
Nice work! According to your answer to question 2(b), when evaluating the model on the LVIS dataset, do you mean that you concatenate a part of category names to meet the limit of 256 text tokens and forward an image sample multiple times with different parts of category names, then merge the results as the final detection results of over 1000 categories on LVIS? Thanks!
I am not the author, but I think they followed GLIP. Which is the same inference process as you described.
Are the pseudo labels on CC3M and SBU the same as the ones used in GLIP/GLIPv2? Or you generated them yourselves? To my knowledge, Microsoft hasn't yet released their pseudo labels. Will you be releseasing them?
Thanks for your questions. We use the GLIP-annotated data for training.
- (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
- loss function is similar to GLIP. More specifically, it is focal loss.
- It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
Nice work! According to your answer to question 2(b), when evaluating the model on the LVIS dataset, do you mean that you concatenate a part of category names to meet the limit of 256 text tokens and forward an image sample multiple times with different parts of category names, then merge the results as the final detection results of over 1000 categories on LVIS? Thanks!
I am not the author, but I think they followed GLIP. Which is the same inference process as you described.
Yes, we follow GLIP. We first run models multiple times with different category names and then merge outputs.
Are the pseudo labels on CC3M and SBU the same as the ones used in GLIP/GLIPv2? Or you generated them yourselves? To my knowledge, Microsoft hasn't yet released their pseudo labels. Will you be releseasing them?
Thanks for your questions. We use the GLIP-annotated data for training.
Do you know where to download it? I don't think they released publicly
- (a) We only use masks for detection data that have annotations. For other data, we do not separate phrases in one sentence. (b) We clip the sentence to ensure it is not too long hahahaha. We have not tried any other sampling strategies. It would be great if you would like to work with us to improve the model.
- loss function is similar to GLIP. More specifically, it is focal loss.
- It is a good question as well. Most of these models are for representation learning, we can simply add a Grounding DINO decoder to these models for open-set detection.
@SlongLiu Very nice work! According to your answer to question 3, the loss function is similar to GLIP. I notice that GLIP assigns the negative category (background) to the last token of the sentence (maybe [EOS] token). Am I right? Did GroundingDINO use the same strategy?
To the Authors
This is a very interesting and good work on visual grounding tasks with a Query-based detector. The paper is also well written and clear. Super interesting results with GLIGEN as well. I do have a few very specific questions about the implementation or concepts in the paper.
Thank you.