IDEA-Research / T-Rex

[ECCV2024] API code for T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
https://deepdataspace.com/blog/T-Rex
Other
2.28k stars 147 forks source link

Model's architecture visualization #78

Closed VilisovEvgeny closed 3 months ago

VilisovEvgeny commented 4 months ago

Hi, I'm trying to understand connection between modules from this image and I have some questions:

  1. Am I understand correctly that in Block 3 the CAT operation is performed between positional embeddings (which contain only information about boxes coordinates) and [C, C'] (where C is gray boxes and C' is box with gray lines)?
  2. From where Block 3 get C and C'?
  3. Aggregator from Block 3 is C'? Where it goes than?
  4. Does Block 3 contain any information from Input image or from it's feature map?
  5. In Block 4 are these black arrows pointing to prompts areas? If this so, this is correspond to bi and f in Q' = MSDeformable()?
  6. In paper on page 5 Box Decoder said that: "Subsequently, the detection queries utilize deformable cross-attention [55] to focus on the encoded multi-scale image features and are used to predict anchor offsets (∆x, ∆y, ∆w, ∆h) at each decoder layer" Does mentioned deformable cross-attention is from Block 4 or is it some another deformable cross-attention which implemented in Block 5 DETR?
  7. In Block 5 from where do we get this detection queries? I read paper and as I understand, we get this queries from inside DETR Decoder. And from image it looks like we get this queries from outside and pass to DETR.

Thanks!

trex2model_moduls

Mountchicken commented 4 months ago

1: Yes 2: C and C' are two learnable parameters, initialized from nn.Embedding(2, channel_dimension) 3: C' is acting as [CLS] token, like that in CLIP. It will aggregate information from other C through self-attention and we take this output from transformer as the final visual prompt embedding. 4: Block3 is independent of the image. 5: Yes 6: We adopt the Decoder in DINO as the DETR Decoder, which is composed of deformable attention 7: The detection queries in DETR Decoder are initialized from nn.Embedding(num_queries, channel_dimension). You can refer to DINO of DAB-DETR for more details

VilisovEvgeny commented 3 months ago

Now it became clear. Thanks!