5g4s / paper

0 stars 0 forks source link

Linear Transformers Are Secretly Fast Weight Programmers #30

Open 5g4s opened 1 year ago

5g4s commented 1 year ago

https://arxiv.org/abs/2102.11174

5g4s commented 1 year ago

We infer a memory capacity limitation of recent linearised softmax attention variants and replace the purely additive outer products by a delta rule-like programming instruction, such that the Fast Weight Programming.

5g4s commented 1 year ago

The general idea of fast weights is to make the weights also variable and input-dependent.

5g4s commented 1 year ago

Context-dependent FWPs were introduced in two-network systems of the early `90s.

$$ \begin{aligned} \boldsymbol{a}^{(i)}, \boldsymbol{b}^{(i)} & =\boldsymbol{W}_a \boldsymbol{x}^{(i)}, \boldsymbol{W}_b \boldsymbol{x}^{(i)} \ \boldsymbol{W}^{(i)} & =\sigma\left(\boldsymbol{W}^{(i-1)}+\boldsymbol{a}^{(i)} \otimes \boldsymbol{b}^{(i)}\right) \ \boldsymbol{y}^{(i)} & =\boldsymbol{W}^{(i)} \boldsymbol{x}^{(i)} \end{aligned} $$

where $\otimes$ denotes the outer product, $\sigma$ is an activation function, $\boldsymbol{W}_a$ and $\boldsymbol{W}_b$ are trainable slow weights, while the fast weights $\boldsymbol{W}^{(i)}$ are generated at each time step $i$ and serve as a short-term memory.

5g4s commented 1 year ago

Viewing linear Transformer variants as Fast Weight Programmers provides us with two insights which we investigate in this work: their capacity limits as associative memories (Sec. 4.1), and their ineptness to edit previously stored associations (Sec.4.2).

5g4s commented 1 year ago
\begin{aligned}
\boldsymbol{k}^{(i)}, \boldsymbol{v}^{(i)}, \boldsymbol{q}^{(i)} & =\boldsymbol{W}_k \boldsymbol{x}^{(i)}, \boldsymbol{W}_v \boldsymbol{x}^{(i)}, \boldsymbol{W}_q \boldsymbol{x}^{(i)} \\
\overline{\boldsymbol{v}}^{(i)} & =\boldsymbol{W}^{(i-1)} \phi\left(\boldsymbol{k}^{(i)}\right) \\
\beta^{(i)} & =\sigma\left(\boldsymbol{W}_\beta \boldsymbol{x}^{(i)}\right) \\
\boldsymbol{v}_{\text {new }}^{(i)} & =\beta^{(i)} \boldsymbol{v}^{(i)}+\left(1-\beta^{(i)}\right) \overline{\boldsymbol{v}}^{(i)}
\end{aligned}

$\beta^{(i)}$ defines to which extent the new value will replace the previous value. We note that while $\beta^{(i)}$ only depends on $x^{(i)}$, in a multi-layer model, $x^{(i)}$ has the full context information except in the first layer.

The final output $y^{(i)}$ are defined as follows.

\begin{aligned}
\boldsymbol{W}^{(i)}=\boldsymbol{W}^{(i-1)} \underbrace{+\boldsymbol{v}_{\text {new }}^{(i)} \otimes \phi\left(\boldsymbol{k}^{(i)}\right)}_{\text {write }} \underbrace{-\overline{\boldsymbol{v}}^{(i)} \otimes \phi\left(\boldsymbol{k}^{(i)}\right)}_{\text {remove }} \\
=\boldsymbol{W}^{(i-1)}+\beta^{(i)}\left(\boldsymbol{v}^{(i)}-\overline{\boldsymbol{v}}^{(i)}\right) \otimes \phi\left(\boldsymbol{k}^{(i)}\right) \\
\boldsymbol{y}^{(i)}=\boldsymbol{W}^{(i)} \phi\left(\boldsymbol{q}^{(i)}\right)
\end{aligned}