Some comparisons against Deformable Attention

haiphamcse commented 7 months ago

Hi there, thank you for the wonderful work on NATTEN! I want to ask about what advantages Neighbor Attention has over some other mechanisms like Deformable Attention in DETR. Thank you

alihassanijr commented 4 months ago

Thank you for your interest, and apologies for responding to this so late.

I can't speak to how Neighborhood Attention would compare against deformable attention in terms of model accuracy without running experiments. There's clearly potential advantages to making your attention mask learnable, which I would guess include deformable attention as well, but I'm personally skeptical of that, because that's essentially what self attention does; it learns projections that are likely to generate attention maps where queries attend to "what matters". But again this is all just high level speculation.

As for performance, I'm all but certain deformable conv/attention tend to suffer the same fate as most non-deterministic methods tend to when compared against highly deterministic methods.

I hope that answers your question, but if not, feel free to keep this issue open.

haiphamcse commented 4 months ago

Hi there, thank you for replying. We have applied Neighborhood Attention to a 3D segmentation task and observed that it outperforms Deformable in semi-supervised settings (training with only 1/5/10% labeled data). This is exciting and we're investigating the cause on why, and will have more information after successful submission ;) I do want to ask more about what is the drawback of non-deterministic attention compared to deterministic ones. As far as i know there does not exist a paper about this subject.

alihassanijr commented 4 months ago

I'm very glad to hear it!

I do want to ask more about what is the drawback of non-deterministic attention compared to deterministic ones. As far as i know there does not exist a paper about this subject.

This was just more an observation of mine. Generally the less predictable any computation is, the harder it will be to performance optimize it. In the case of deformable attention, one issue is that it is like neighborhood attention a matrix-vector multiplication, which is a typically a memory-bound problem, and tricks like kernel fusion will likely not help performance, so effectively they don't end up utilizing the compute power modern GPUs provide.

However, because of the predictability of neighborhood attention (i.e. spatially proximate queries overlap greatly in their KV neighborhood), we can actually model it as a matrix-matrix multiplication + masking operation, and arguably come up with less memory bound solutions, i.e. fused kernels. This is exactly why Fused Neighborhood Attention was possible, and is capable of actually reducing runtime when you reduce your attention window size.

That said, even in the case of NA, always getting a reduced runtime just because of reducing the window size is a bit unrealistic, and that's just due to a number of reasons, which we discuss in Faster Neighborhood Attention.

Due to all of this, the best way I can think of to accelerate/perf-optimize deformable attention is to just run masked self attention, or keep doing the matrix-vector operation, which just can't occupy as many resources on your GPU, and will therefore be bottlenecked in performance.

But of course, these are just all opinions based on my experience. I could be wrong, and would be happy if anyone corrected me on this 🙂 .

haiphamcse commented 4 months ago

Thank you for the discussion, i'm not really familiar with implementation details of attention so i'll take your word for it ;) I'll close the issue now, hope you guys can produce more cool attention for the community

SHI-Labs / Neighborhood-Attention-Transformer

Some comparisons against Deformable Attention #99