Efficiency claims on attention module used

Smerity / sha-rnn

Single Headed Attention RNN - "Stop thinking with your head"

1.18k stars 134 forks source link

Apologies for the delayed reply.

You are correct that dot product attention requires N by N dot products to compute the attention.

The claim for attention efficiency for the SHA-RNN is along the lines of Shazeer's One Write-Head is All You Need. Given the keys and values do not require a matrix multiplication there's substantial computational savings with only the queries requiring a matrix multiplication. That's why I note the vector-vector operation.

For reducing the N by N attention component you would indeed need to look towards other potential solutions (approximate attention, sparse attention, ...).

Smerity / sha-rnn

Efficiency claims on attention module used #15