Open fuweifu-vtoo opened 3 months ago
three blocks of (Deformable cross attention + self attention + FFN)
Hi @Mountchicken
Previously you referred to code from Grounding DINO:
https://github.com/IDEA-Research/T-Rex/issues/85#issuecomment-2363020703
, the DeformableTransformerDecoderLayer class.
I would like to clarify, when you mention "Deformable cross attention" do you mean DeformableTransformerDecoderLayer
or the only self.cross_attn
module from this class?
If I understood correctly, then
DeformableTransformerDecoderLayer == (Deformable cross attention + self attention + FFN)
Am I right in my conclusions?
The visual prompt encoder consists of serval DeformableTransformerDecoderLayer
Dear author, I have another question for you:
In Visual Prompt Encoder, is it stacking three layers of deformable cross-attention layer, then connecting one self attention and one FFN?
Or stacking three blocks of (Deformable cross attention + self attention + FFN)