facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.38k stars 2.41k forks source link

Positional Encoding at Encoder and Decoder Stages #504

Open AlexTS1980 opened 2 years ago

AlexTS1980 commented 2 years ago

I still don't quite get the translation of the LM transformer models to DeTr:

First of all, I don't get what exactly Positional Encoding encodes. In LMs it's positions of tokens in a sequence, but what about DeTr? My confusion arose from several seemingly incongruent observations:

A. Input image has dimensions 3xhxw

B. Feature extractor outputs features dimensions CxHxW (or 1xCxHxW)

C. As I understand, positional encodings take some indices as inputs, filter them through some learnable Embedding (like in LMs), and output some tensor, each with the predefined dimensionality d.

D. This tensor takes an elementwise sum with the output in B, so they must have the same dimensionality.

E. At the same time, the output of the encoder presented in Figure 3 seems to have dimensions h*w x h x w, because self-attention maps are for each pixel in the map (h*w total) with dimensions of the input image (h x w), not H x W. So how was it obtained?

So, roughly speaking, how many positional encodings are there, h * w or H * W

Now, if that wasn't enough, both output of the encoder and a set of object queries are inputs in the decoder. If, again, my understanding is sort of correct, there are N indices for object queries, each is also filtered through some kind of embeddings, and then somehow concatenated with the encoder outputs to get final object queries of some predefined dimensionality. Again, looking at Figure 6, it seems that the output of the decoder seems to have dimensions N x h x w, not NxHxW.

So what am I doing wrong?

Shar-01 commented 1 year ago

@AlexTS1980 I am also trying to understand how the positional encoding at the encoder stage is implemented such that the dimensions are compatible. And what exactly is the encoding w.r.t the input images. Did you happen to work something out here?