Closed imMid-Star closed 10 months ago
I also want to ask whether the reduction of complexity comes mainly from cancelling the operation between key and value?
Hi @imMid-Star ,
The complexity of the efficient additive attention is O(n.d).
No, the reductions in complexity come from avoiding the expensive dot product in computing the attention maps. Cancelling the operation of key and value just enhances the speed, without sacrificing the accuracy.
Hello, thanks for your great work! For vanilla self-attention and transpose self-attention, the complexity is O(n2·d) and O(n·d2). Can u please tell me how to compute the complexity of the efficient additive attention? Thanks in advance!