Research Self-Attention

CJcool06 commented 1 year ago

Aim Find out what self-attention actually does (ie. benefits, limitations) and what research is already out there.

Plan

CJcool06 commented 1 year ago

Paper: Low-Rank and Locality Constrained Self-Attention for Sequence Modelling

Comments: They create low-rank self-attention by restricting the encoding dimension to r. They apply band attention afterwards using the result from the low-rank self-attention with the original input. It's not clear to me how exactly they did this, would be worth looking into further. Skeptical of some conclusions and results. The maths and matrix decomposition was very nice and worth coming back to.

CJcool06 commented 1 year ago

Paper: Neural Machine Translation by Jointly Learning to Align and Translate

Comments: This is the earliest paper that I've found that resembles the attention that is used in Transformers today. Given an RNN, each forward encoder hidden state outputs an "annotation" that contains the encoder hidden states up to and including the current, concatenated with hidden states that are computed in a backward encoder block. These annotations (len(annotations) == sequence_length) are then used by the decoder by weighting them with a learned "alignment model". This model uses the previous decoder state to generate a new set of softmaxed probabilities for each input-annotation pair. They call this "soft-alignment" (compared to hard-alignment) as it allows 'skipping' words and then tracking back to them , and also allows source and target sequences to have different lengths. All of this resembles the attention that we know today.

The intuition for concatenating the hidden states from both forward and backwards RNNs is that we can use the information from before AND after a word to decide the next word.

It's worth noting that they have essentially made each hidden state output a context vector, instead of the entire encoder block outputting a single context vector at the end. The intuition behind this is that it frees the model from having to encode a whole source sentence into a fixed-length vector, and also lets the model focus only on information relevant to the generation of the next target word.

It's also worth noting a limitation -- RNN encoders are biased to words that are close together in the sequence. This implicit locality is not reflected in a Transformer.

This was a great paper to read and brought me closer to the aim of this project:

Attention was used here to let the encoder be more expressive
Attention is used for focusing on relevant parts of a sequence that may be n positions behind/in-front of the current word
Computing attention scores for long sequences is very expensive

CJcool06 commented 1 year ago

Paper: Long Short-Term Memory-Networks for Machine Reading

Comments: This paper introduced me to the concept of intra-attention, that is, attention within the encoder/decoder block. This is different to inter-attention which allows the decoder to attend to the encoder outputs. It's still not clear to me what the purpose of intra-attention is with respect to a Transformer. In this paper, they use intra-attention to choose which parts of memory to focus on.

The paper focusses on RNNs with LSTM. Their contributions are attention-based memory addressing and introducing a method that does not require specific attention training (ie. it is trained with the rest of the network). To my understanding, the attention is parameterised.

Note that there are some soft similarities to a Transformer with how they calculate the attention. Might be worth coming back to.

ChrisFuscoMasters / TransformerLib

Research Self-Attention #9