Open 5g4s opened 1 year ago
We make use of the associativity property of matrix products to reduce the complexity from $\mathcal{O}\left(N^2\right)$ to $\mathcal{O}\left(N\right)$, where $N$ is the sequence length.
We present that changing the attention from the traditional softmax attention to a feature map based dot product attention.
More related to our model are the works of Child et al. (2019) and Kitaev et al. (2020). The former (Child et al., 2019) introduced sparse factorizations of the attention matrix reducing the overall complexity from quadratic to $\mathcal{O}\left(N^2\right)$ for generative modeling of long sequences. More recently, Kitaev et al. (2020) proposed Reformer. This method further reduces complexity to O (N log N) by using locality-sensitive hashing (LSH) to perform fewer dot products.
We instead show that a self-attention layer trained with an autoregressive objective can be seen as a recurrent neural network and this observation can be used to significantly speed up the inference time of autoregressive transformer models.
We propose to use the element-wise function as fellows.
\begin{equation}
\phi(x)=\text{ELU}(x)+1= \begin{cases}x+1, & \text { if } x>0 \\ \exp (x), & \text { if } x \leq 0\end{cases}
\end{equation}
https://arxiv.org/abs/2006.16236