SHI-Labs / NATTEN

Neighborhood Attention Extension. Bringing attention to a neighborhood near you!
https://shi-labs.com/natten/
Other
345 stars 26 forks source link

Inverted register attention #131

Open Xynonners opened 4 months ago

Xynonners commented 4 months ago

Repost from #82 since it was closed and probably didn't get seen:

If I'm understanding this correctly the current additional_keys/values allows NA to attend to extra tokens outside of the original self-attn q/k/v.

I've been thinking though, with architectures like FIT https://arxiv.org/abs/2305.12689 (grouped perceiver architecture), would it be possible to invert this? aka have registers attend to multiple different neighborhoods?

alihassanijr commented 4 months ago

That might be complicated, because how would one define exactly what neighborhood each register will attend to? Could you clarify that a bit more?

The additional KV feature lets every query attend to some additional key value pairs.

It is possible to have register queries attend to everything though; but that is just cross attention, and NATTEN ops would be agnostic to that, so you can just do it in a separate branch.

Xynonners commented 4 months ago

That might be complicated, because how would one define exactly what neighborhood each register will attend to? Could you clarify that a bit more?

The additional KV feature lets every query attend to some additional key value pairs.

It is possible to have register queries attend to everything though; but that is just cross attention, and NATTEN ops would be agnostic to that, so you can just do it in a separate branch.

Yeah sorry, I think I got confused for a sec.

iiuc the additional KV essentially is equivalent to a global attention between the input and the registers, and therefore an inversion would essentially be equivalent to a global cross attention (aka not save any compute power?)

looking at it a bit closer I believe the research paper I linked to had a specific register for each local group, which the registers would then globally attend to each other (and therefore solved the quadratic complexity).

Thanks for the response.