Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

5g4s / paper

0 stars 0 forks source link

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention #27

Open 5g4s opened 1 year ago

5g4s commented 1 year ago

https://arxiv.org/abs/2006.16236

5g4s commented 1 year ago

We make use of the associativity property of matrix products to reduce the complexity from $\mathcal{O}\left(N^2\right)$ to $\mathcal{O}\left(N\right)$, where $N$ is the sequence length.

5g4s commented 1 year ago

We present that changing the attention from the traditional softmax attention to a feature map based dot product attention.

5g4s commented 1 year ago

2. Related Work

2.1 Efficient Transformers

More related to our model are the works of Child et al. (2019) and Kitaev et al. (2020). The former (Child et al., 2019) introduced sparse factorizations of the attention matrix reducing the overall complexity from quadratic to $\mathcal{O}\left(N^2\right)$ for generative modeling of long sequences. More recently, Kitaev et al. (2020) proposed Reformer. This method further reduces complexity to O (N log N) by using locality-sensitive hashing (LSH) to perform fewer dot products.

5g4s commented 1 year ago

We instead show that a self-attention layer trained with an autoregressive objective can be seen as a recurrent neural network and this observation can be used to significantly speed up the inference time of autoregressive transformer models.

5g4s commented 1 year ago

We propose to use the element-wise function as fellows.

\begin{equation}
\phi(x)=\text{ELU}(x)+1= \begin{cases}x+1, & \text { if } x>0 \\ \exp (x), & \text { if } x \leq 0\end{cases}
\end{equation}