What's the complexity of additive attention?

Amshaker / SwiftFormer

[ICCV'23] Official repository of paper SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

236 stars 25 forks source link

What's the complexity of additive attention? #7

Closed imMid-Star closed 10 months ago

imMid-Star commented 12 months ago

Hello, thanks for your great work! For vanilla self-attention and transpose self-attention, the complexity is O(n2·d) and O(n·d2). Can u please tell me how to compute the complexity of the efficient additive attention? Thanks in advance!

imMid-Star commented 12 months ago

I also want to ask whether the reduction of complexity comes mainly from cancelling the operation between key and value?

Amshaker commented 10 months ago

Hi @imMid-Star ,

The complexity of the efficient additive attention is O(n.d).

No, the reductions in complexity come from avoiding the expensive dot product in computing the attention maps. Cancelling the operation of key and value just enhances the speed, without sacrificing the accuracy.