IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.75k stars 685 forks source link

dense similarity predictions #60

Open caoyunkang opened 1 year ago

caoyunkang commented 1 year ago

Hi! Thanks for your awesome work. I am just wondering if there is any possibility of extracting dense similarity scores between an inputted image and textual prompts. Exactly, I have tried to extract dense similarity according to the following pseudo-code with the text features and image features after the Feature Enhanced. However, I found that the similarity between them is nearly nonsense. I just would like to check out if there are any other suggestions, as dense similarity is vital for several open world tasks.

enhanced_image_features = F.normalize(enhanced_image_features, dim=-1)
enhanced_text_features = F.normalize(enhanced_text_features, dim=-1)
similarity = enhanced_image_features  @ enhanced_text_features.T
caoyunkang commented 1 year ago

Also, I try to use the topk_logits in transformer.py. Unfortunately, it still makes nonsense. Ideally, it should produce something like heatmaps. Do you have any kind suggestions?

topk_logits = enc_outputs_class_unselected.max(-1)[0]
Dwrety commented 1 year ago

In the paper, they don't normalize the feature. Maybe that's the reason? I actually posted the 1st issue post asking about this. This is one thing that can be considered. The topk_logits are already the heatmap between (0, 1), each query is basically matched to 256 tokens. However it's trained with FocalLoss, so it may actually get high reponses with multiple text tokens. The inference basically take the highest responses.