Hi, I'm looking at groundingdino/models/GroudingDINO/transformer.py, from line 545 - 593 it looks like the order of modules is Bi-Direction MHA (text->image, image->text) -> text self-attention and image deformable self-attention, which is different from the order depicted in the main figure:
which has text self-attention and image deformable self-attention before the fusion. Can I ask why?
Hi, I'm looking at
groundingdino/models/GroudingDINO/transformer.py
, from line 545 - 593 it looks like the order of modules isBi-Direction MHA (text->image, image->text)
->text self-attention
andimage deformable self-attention
, which is different from the order depicted in the main figure:which has
text self-attention
andimage deformable self-attention
before the fusion. Can I ask why?Thank you.