Closed VilisovEvgeny closed 3 months ago
1: Yes
2: C and C' are two learnable parameters, initialized from nn.Embedding(2, channel_dimension)
3: C' is acting as [CLS] token, like that in CLIP. It will aggregate information from other C through self-attention and we take this output from transformer as the final visual prompt embedding.
4: Block3 is independent of the image.
5: Yes
6: We adopt the Decoder in DINO as the DETR Decoder, which is composed of deformable attention
7: The detection queries in DETR Decoder are initialized from nn.Embedding(num_queries, channel_dimension)
. You can refer to DINO of DAB-DETR for more details
Now it became clear. Thanks!
Hi, I'm trying to understand connection between modules from this image and I have some questions:
"Subsequently, the detection queries utilize deformable cross-attention [55] to focus on the encoded multi-scale image features and are used to predict anchor offsets (∆x, ∆y, ∆w, ∆h) at each decoder layer"
Does mentioned deformable cross-attention is from Block 4 or is it some another deformable cross-attention which implemented in Block 5 DETR?Thanks!