Closed stella-von closed 10 months ago
Hi @stella-von, thank you for your attention to our work. In our Flatten-Swin model, the utilization of attention masks is omitted. Flatten Swin adopts self-attention with a global receptive field, thus eliminating the need for window partitioning and attention masks.
Hi @tian-qing001, thank you for your reply.
Due to the high image resolution of the downstream task I am doing, the global attention may not be suitable. May I ask if window partitioning is needed, is the mask added after kv = torch.einsum("b j c, b j d -> b c d", k, v)?
Hi, @stella-von. Implementing attention masks in low-rank linear attention proves challenging, as discussed in many previous works. When using window partitioning, you can simply discard the attention mask, as we did in Table 6 of the paper. It's noteworthy that our module exhibits linear complexity. Therefore, the computation cost remains consistent whether using global attention directly or employing window attention.
Thank you for your excellent work.
I have a question about whether FocusedLinearAttention (in flatten_swin.py) uses a mask and why.