About Visual Prompt Encoder.

fuweifu-vtoo commented 3 months ago

Dear author, I have another question for you：

In Visual Prompt Encoder, is it stacking three layers of deformable cross-attention layer, then connecting one self attention and one FFN?

Or stacking three blocks of (Deformable cross attention + self attention + FFN)

Mountchicken commented 3 months ago

three blocks of (Deformable cross attention + self attention + FFN)

pisiguiii commented 2 months ago

Hi @Mountchicken

Previously you referred to code from Grounding DINO: https://github.com/IDEA-Research/T-Rex/issues/85#issuecomment-2363020703 , the DeformableTransformerDecoderLayer class. I would like to clarify, when you mention "Deformable cross attention" do you mean DeformableTransformerDecoderLayer or the only self.cross_attn module from this class?

If I understood correctly, then DeformableTransformerDecoderLayer == (Deformable cross attention + self attention + FFN) Am I right in my conclusions?

Mountchicken commented 2 months ago

The visual prompt encoder consists of serval DeformableTransformerDecoderLayer

IDEA-Research / T-Rex

About Visual Prompt Encoder. #83