[feature request] Attention Sink support?

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

BSD 3-Clause "New" or "Revised" License

12.98k stars 1.17k forks source link

[feature request] Attention Sink support? #603

Open arendu opened 10 months ago

arendu commented 10 months ago

Hi Flash-Attention Team, Are there any plans to support Attention Sink style (https://arxiv.org/pdf/2309.17453v1.pdf) attention maps for causal language models? TIA!

tridao commented 10 months ago

I haven't looked too closely but they already provided an implementation linked in their paper.

arendu commented 10 months ago

yes, the paper does have a code link, but it uses native pytorch attention iiuc not FA-2 based.

tridao commented 10 months ago

How's the speed there? For training or inference? How much do you think it can be improved?

arendu commented 10 months ago

I will run some comparisons with FA-2's sliding window attention and let you know.

lucidrains commented 10 months ago

there's actually some potential follow up research to attention sink that only Tri's library will be able to support (after a few pull requests). mainly i believe that for learnable sinks, one needs to be able to leave the queries and keys unrotated, for queries of the main sequence to the sink keys. there's no way to do this efficiently unless if it were fused.

lucidrains commented 10 months ago

anyways, the concept of sinks, if it becomes important, will require a rethinking of how to approach relative positions. or maybe some phd student will surprise us all with a clever solution 🤞

sk-g commented 6 months ago

there's actually some follow up research to attention sink that only Tri's library will be able to support (after a few pull requests). mainly i believe that for learnable sinks, one needs to be able to leave the queries and keys unrotated, for queries of the main sequence to the sink keys. there's no way to do this efficiently unless if it were fused.

Hi @lucidrains , could you please share some follow ups?

Very recently this paper came out and seems promising: https://github.com/FlagOpen/FlagEmbedding/tree/master/Long_LLM/activation_beacon

lucidrains commented 6 months ago

@sk-g i haven't been following this line of research for a while

what did you want an update on specifically?

sk-g commented 6 months ago

@lucidrains Ah, I was just curious about what work you were refferring to when you said this

there's actually some follow up research to attention sink

lucidrains commented 6 months ago

@sk-g ah, follow up research in my mind haha

edited it for less confusion