IDEA-Research / T-Rex

[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
https://deepdataspace.com/blog/T-Rex
Other
2.28k stars 147 forks source link

About Visual Prompt Encoder. #83

Open fuweifu-vtoo opened 3 months ago

fuweifu-vtoo commented 3 months ago

Dear author, I have another question for you:

In Visual Prompt Encoder, is it stacking three layers of deformable cross-attention layer, then connecting one self attention and one FFN?

Or stacking three blocks of (Deformable cross attention + self attention + FFN)

Mountchicken commented 3 months ago

three blocks of (Deformable cross attention + self attention + FFN)

pisiguiii commented 2 months ago

Hi @Mountchicken

Previously you referred to code from Grounding DINO: https://github.com/IDEA-Research/T-Rex/issues/85#issuecomment-2363020703 , the DeformableTransformerDecoderLayer class. I would like to clarify, when you mention "Deformable cross attention" do you mean DeformableTransformerDecoderLayer or the only self.cross_attn module from this class?

If I understood correctly, then DeformableTransformerDecoderLayer == (Deformable cross attention + self attention + FFN) Am I right in my conclusions?

Mountchicken commented 2 months ago

The visual prompt encoder consists of serval DeformableTransformerDecoderLayer