Closed xsola closed 3 years ago
We first flatten the patching embedding to a 1D sequence. To encode the patch spacial information, we learn a specific position embedding for every location which is then added to patch embedding to form the final 1D sequence input.
I am just wondering whether your position embedding is learnt by network, or using sine and cosine functions as in paper "Attention is what you need". Since position embedding is important, I think it's necessary to give a brief description how to implement it. In paper "End-to-End Video Instance Segmentation with Transformers", it also used a transformer for segmentation, and gave a description how position info is embedded, while used experiments to demonstrate the effectiveness of pos embedding. So I think it may be better to give a briefly description of the implementation of pos embedding.
yes I clearly mentioned above that
we *learn* a specific position embedding
I have the same question. Does "learn a specific position embedding" means using one linear (FC) layer mapping the original absolute coordinates inputs (with size: (L, 2)) to the size of (L, C) ? Also, why use adding instead concatenation to fuse the position embedding and image embedding?
No. There is no such a thing (L, 2). We direct learn position embeddings with size (L, C).
Thanks for your helps. Do you mean the position embeddings are also mapped by original patch-image: (L, 256)--(L,1024), just like the patch embedding, instead by the original coordinates of pixels? Will it be better to use absolute coordinates (L, 2) to map the position embedding?
there is no mapping in the position embedding. we directly learn the position embedding with size (L, C). You might need to take a quick tutorial on the transformer.
there is no mapping in the position embedding. we directly learn the position embedding with size (L, C). You might need to take a quick tutorial on the transformer.
I have a question, the original image is divided into patches and the position encoding is based on these PATCHes, but what about the position encoding inside PATCHes? In this case, will there be any problem for dense prediction like semantic segmentation?
Since the patches come from a 2D images, the position information consists of two directions, in other words, x-axis and y-axis indexes. This is different from the case in 1-D sequence. How do you implement the position embedding? Can you share the details since the code is not released?