A question about the mask of FocusedLinearAttention in flatten_swin.py

LeapLabTHU / FLatten-Transformer

Official repository of FLatten Transformer (ICCV2023)

388 stars 23 forks source link

A question about the mask of FocusedLinearAttention in flatten_swin.py #16

Closed stella-von closed 10 months ago

stella-von commented 10 months ago

Thank you for your excellent work.

I have a question about whether FocusedLinearAttention (in flatten_swin.py) uses a mask and why.

tian-qing001 commented 10 months ago

Hi @stella-von, thank you for your attention to our work. In our Flatten-Swin model, the utilization of attention masks is omitted. Flatten Swin adopts self-attention with a global receptive field, thus eliminating the need for window partitioning and attention masks.

stella-von commented 10 months ago

Hi @tian-qing001, thank you for your reply.

Due to the high image resolution of the downstream task I am doing, the global attention may not be suitable. May I ask if window partitioning is needed, is the mask added after kv = torch.einsum("b j c, b j d -> b c d", k, v)？

tian-qing001 commented 10 months ago

Hi, @stella-von. Implementing attention masks in low-rank linear attention proves challenging, as discussed in many previous works. When using window partitioning, you can simply discard the attention mask, as we did in Table 6 of the paper. It's noteworthy that our module exhibits linear complexity. Therefore, the computation cost remains consistent whether using global attention directly or employing window attention.