About feature enhancer architecture

IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

https://arxiv.org/abs/2303.05499

Apache License 2.0

6.9k stars 699 forks source link

About feature enhancer architecture #214

Open DianCh opened 1 year ago

DianCh commented 1 year ago

Hi, I'm looking at groundingdino/models/GroudingDINO/transformer.py, from line 545 - 593 it looks like the order of modules is Bi-Direction MHA (text->image, image->text) -> text self-attention and image deformable self-attention, which is different from the order depicted in the main figure:

which has text self-attention and image deformable self-attention before the fusion. Can I ask why?

Thank you.

EddieEduardo commented 11 months ago

Same confusion

xiexie123 commented 4 months ago

eeeric-code commented 2 months ago

same question