fudan-zvg / SETR

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
MIT License
1.05k stars 150 forks source link

About the position embeddings for patches #2

Closed xsola closed 3 years ago

xsola commented 3 years ago

Since the patches come from a 2D images, the position information consists of two directions, in other words, x-axis and y-axis indexes. This is different from the case in 1-D sequence. How do you implement the position embedding? Can you share the details since the code is not released?

lzrobots commented 3 years ago

We first flatten the patching embedding to a 1D sequence. To encode the patch spacial information, we learn a specific position embedding for every location which is then added to patch embedding to form the final 1D sequence input.

xsola commented 3 years ago

I am just wondering whether your position embedding is learnt by network, or using sine and cosine functions as in paper "Attention is what you need". Since position embedding is important, I think it's necessary to give a brief description how to implement it. In paper "End-to-End Video Instance Segmentation with Transformers", it also used a transformer for segmentation, and gave a description how position info is embedded, while used experiments to demonstrate the effectiveness of pos embedding. So I think it may be better to give a briefly description of the implementation of pos embedding.

lzrobots commented 3 years ago

yes I clearly mentioned above that

we *learn* a specific position embedding

QiushiYang commented 3 years ago

I have the same question. Does "learn a specific position embedding" means using one linear (FC) layer mapping the original absolute coordinates inputs (with size: (L, 2)) to the size of (L, C) ? Also, why use adding instead concatenation to fuse the position embedding and image embedding?

lzrobots commented 3 years ago

No. There is no such a thing (L, 2). We direct learn position embeddings with size (L, C).

QiushiYang commented 3 years ago

Thanks for your helps. Do you mean the position embeddings are also mapped by original patch-image: (L, 256)--(L,1024), just like the patch embedding, instead by the original coordinates of pixels? Will it be better to use absolute coordinates (L, 2) to map the position embedding?

lzrobots commented 3 years ago

there is no mapping in the position embedding. we directly learn the position embedding with size (L, C). You might need to take a quick tutorial on the transformer.

675492062 commented 1 year ago

there is no mapping in the position embedding. we directly learn the position embedding with size (L, C). You might need to take a quick tutorial on the transformer.

I have a question, the original image is divided into patches and the position encoding is based on these PATCHes, but what about the position encoding inside PATCHes? In this case, will there be any problem for dense prediction like semantic segmentation?