SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022
MIT License
1.05k stars 86 forks source link

About the receptive field of image pixel #90

Closed money6651626 closed 1 year ago

money6651626 commented 1 year ago

Thank you for this good work, and I'm a little confused about the pixel's receptive field. I have read your paper and the answers to the issues carefully.Can I understand the effect of DINAT as an expansive SA like expansive convolution(Of course you also mentioned that they are different). You also increase the receptive field by NAT and DINAT in order to avoid gridding effect. If my understanding is OK, shall I calculate the receptive field of the central pixel of DINAT by a method similar to the expanded convolution receptive field? Assume that the size of the feature map is 64×64(max patch_size=floor(64/7)),patch_size(kenel_size)=7, DINAT_list(expansive ration)=[1, 4, 1, 8] For the center pixel: receptive field: block_1: 7 block_2: 25 block_3:32 block_4:81 This way the expansion rate sequence is calculated beyond the existing receptive field (of course this is for the center pixel, and for the edge pixel such receptive field can support it to the other edge). My question about this is how this overflow is handled in NATTEN.

alihassanijr commented 1 year ago

Thanks for your interest.

The receptive field upper bound is always the number of pixels, therefore such an issue will never occur.

If you're asking why neighborhood attention can be dilated more than convolution (by which I mean discrete cross-correlation, which in deep learning is referred to as convolution; and not the convolution operator), it goes back to how corner cases are handled in neighborhood attention (centering the query token is best effort, not guaranteed).

money6651626 commented 1 year ago

image

Hello! I have noticed a discrepancy in the calculation of different attention mechanism costs in the NAT and DINAT papers. In both papers, they mention "k" as the window size and neighborhood size" (which I assume are the same), but there seems to be a squared difference in the calculations of FLOPs and memory between the two papers. Is there any issue with my understanding?

alihassanijr commented 1 year ago

Thank you for brining this to our attention.

NAT's FLOP notation was 2-dimensional, DiNAT is 1-dimensional. Hence the n symbol for number of tokens (== H x W).

DiNAT's notation was changed to 1-D to simplify illustrating receptive field growth without having to have 2 dimensions for space, attention weights, and RF.

money6651626 commented 1 year ago

Thank you for your answer!