Closed NorbertZheng closed 1 year ago
In 2017, Vaswani et al. published a paper titled “Attention Is All You Need” for the NeurIPS conference. They introduced the original transformer architecture for machine translation, performing better and faster than RNN encoder-decoder models, which were mainstream.
The big transformer model achieved SOTA (state-of-the-art) performance on NLP tasks.
The transformer is more performant and straightforward than other models that it superseded. The author says:
However, the word “simple” is probably not how most readers feel when looking at the architecture diagram first. Figure 1: The Transformer — model architecture
Reading the paper is inspirational but provokes many questions, and searching for the answers does not always yield a satisfactory explanation.
Detail-oriented readers might have many doubts about positional encoding, which we discuss in this article with the following questions:
In 3.5 Positional Encoding of the paper, the author explains why they need to encode the position of each token (word, special character, or whatever distinct unit):
Historically, researchers used RNNs (Recurrent Neural Networks) to keep track of ordinal dependencies in sequential data. Later, researchers incorporated attention mechanisms into RNN encoder-decoder architectures to better extract contextual information for each word.
On the contrary, the transformer’s encoder-decoder architecture uses attention mechanisms without recurrence and convolution. That’s what the paper title “Attention Is All You Need” means.
However, without positional information, an attention-only model might believe the following two sentences have the same semantics:
That’d be a bad thing for machine translation models. So, yes, we need to encode word positions (note: I’m using ‘token’ and ‘word’ interchangeably).
The question is: how?
Vaswani et al. explain how they put positional encoding (PE):
Let’s look at where inputs feed to the encoder. The positional encoding happens after input word embedding and before the encoder.
The author explains further:
The base transformer uses word embeddings of 512 dimensions (elements). Therefore, the positional encoding also has 512 elements, so we can sum a word embedding vector and a positional encoding vector by element-wise addition.
Using a concatenation of two vectors, the model sees positional encoding vector independent from word embedding, which might make learning (optimization) easier.
However, there are some drawbacks. It increases the memory footprint, and
How about reducing the number of elements in positional encoding? Concatenation does not require two vectors to have equal dimensions. So, we can adjust the size of the encoding vector just big enough to support positional encoding, can’t we? One problem with this approach is that the number of dimensions in the positional encoding will become another hyper-parameter to tune.
Granted that we can think of various alternative approaches to positional encoding, the truth is they pushed positional encoding vectors into the same space as the word embedding vectors, and it worked.
The model can learn to use the positional information without getting confused with the embedding (semantic) information. It’s hard to imagine what’s happening inside the network, yet it works.
Moreover, it benefits from
Also, we don’t need an additional hyper-parameter. So, let’s accept this as a viable method and continue reading the paper.
How do we calculate positional encoding? The author explains it as follows:
$$ \begin{aligned} PE{(pos,2i)}&=\sin(\frac{pos}{10000^{\frac{2i}{d{model}}}})\ PE{(pos,2i+1)}&=\cos(\frac{pos}{10000^{\frac{2i}{d{model}}}}), \end{aligned} $$
where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ (i.e. $\sin(pos)$, where $i$ is 0) to $10000 \cdot 2\pi$ (i.e. $\sin(\frac{pos}{10000})$, where $i$ is 9998), but not uniformly.
The above formulas look complex. So, let’s expand it, assuming the model uses $d_{model}=512$ dimensions for positional encoding (and word embedding).
We represent each token as a vector of 512 dimensions (from element 0 to element 511).
A position (pos) is an integer from 0 to a pre-defined maximum number of tokens in a sentence (a.k.a. max length) minus 1. For example, if the max length is 128, positions are 0 to 127.
For every pair of elements, i.e. $\sin(\cdot)$ and $\cos(\cdot)$, we calculate the angle (in radians) as follows:
$$ \frac{pos}{10000^{\frac{2i}{512}}}. $$
Since we have 512 dimensions, we have 256 pairs of sine and cosine values. As such, the value of i goes from 0 to 255.
Considering everything, we calculate each element of the positional encoding vector for a position (pos) as follows:
So, the encoding for word position 0 is: Position 0 is a vector with alternatively repeating 0s and 1s.
Position 1 is as follows: Note: I truncated element values to the fourth decimal place.
We can see that the sine and cosine pair at the end is close to 0 and 1, which makes sense since the denominator inside the sine and cosine functions becomes very big, that the angle is pretty much zero.
Looking at these numbers makes not much sense. Let’s visualize them and see if we can find any patterns.
We can write a short Python script to generate all the positional encoding values:
import math
import numpy as np
MAX_SEQ_LEN = 128 # maximum length of a sentence
d_model = 512 # word embedding (and positional encoding) dimensions
# pre-allocates vectors with zeros
PE = np.zeros((MAX_SEQ_LEN, d_model))
# for each position, and for each dimension
for pos in range(MAX_SEQ_LEN):
for i in range(d_model//2):
theta = pos / (10000 ** ((2*i)/d_model))
PE[pos, 2*i ] = math.sin(theta)
PE[pos, 2*i + 1] = math.cos(theta)
Instead of using for-loop, we can take advantage of NumPy’s parallelizable operations (inspired by the PyTorch tutorial):
# replace the above for loop
pos = np.arange(MAX_SEQ_LEN)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
PE[:, 0::2] = np.sin(pos * div_term)
PE[:, 1::2] = np.cos(pos * div_term)
np.arange(0, d_model, 2)
generates integers from 0
to d_model - 2
. Below is an example where d_model = 10
.
>>> np.arange(0, 10, 2)
array([0, 2, 4, 6, 8])
So, we are calculating the below for each 2i from the np.arange term:
$$ \begin{aligned} \exp \left( \dfrac{ 2i \times (-\log 10000) }{ d{model} } \right) &= \exp \left( \log 10000^{ -\dfrac{2i}{ d{model} } } \right) \ &= 10000 ^{- \dfrac{ 2i }{ d{model} } } \ &= \dfrac{1}{10000 ^{\dfrac{ 2i }{ d{model} } } } \end{aligned} $$
We use this as the division term in the positional encoding formula:
$$e Word Positions? How do we calculate positional encoding? The author explains it as follows:
$$ \begin{aligned} PE{(pos,2i)}&=\sin(\frac{pos}{10000^{\frac{2i}{d{model}}}})\ PE{(pos,2i+1)}&=\cos(\frac{pos}{10000^{\frac{2i}{d{model}}}}), \end{aligned} $$
Both codes produce the same values. I like the one with for-loops better because it’s easy to understand, and it runs fast enough to experiment with a small MAX_SEQ_LEN
. For actual implementations, it’d be better to use the faster version. In either case, we only need to generate positional encoding values once.
Let’s draw the positional encoding values as a 2D image:
import matplotlib.pyplot as plt
im = plt.imshow(PE, aspect=’auto’)
plt.title(“Positional Encoding”)
plt.xlabel(“Encoding Dimension”)
plt.ylabel(“Position Index”)
plt.colorbar(im)
plt.show()
The top row is position 0, and the bottom is position 127 as I used the max length of a sentence = 128. The x-axis is for vector dimension index from 0 to 511.
The values are between -1 and 1 since they are from sine and cosine functions. Darker colors indicate values closer to -1, and brighter (yellow) colors are closer to 1. Green colors indicate values in between. For example, dark green colors indicate values around 0.
As expected, the values towards the right (the end of vector elements) seem to have alternating 0s (dark green) and 1s (yellow). But it’s hard to see due to fine pixels.
So, I created a smaller version of the encoding matrix with the $max\ length=16$ and the $d_{model}=32$.
Encoding vectors with bigger position indices have less number of alternating 0s and 1s. Each row appears to have a unique pattern, so I would imagine that the model could somehow distinguish between them. We’ll see more details of positional encoding calculation later on.
Next, let’s interpret what positional encoding represents.
To have more intuition about positional encoding, let’s look at it from a different perspective. As we know, positional encoding has pairs of sine and cosine functions. If we take any pair and plot positions on a 2D plane, they are all on a unit circle.
When we increase the value pos, the point moves clockwise. It’s like a clock hand that moves as word position increases. In a positional encoding vector with 512 dimensions, we have 256 hands.
Looking at the first two pairs in positional encoding, the second-hand moves slower than the first because the denominator increases as the dimensional index increases.
So, we can interpret
Towards the end of a positional encoding vector, hands move slower and slower when increasing position index (pos).
The hands near the end of the dimensions are slow because the denominator is large. The angles are approximately zeros there unless the position index is significant enough.
We can confirm that by looking at the positional encoding image again. Below is the smaller version of the positional encoding image.
So, we can think of
Now, let’s try interpreting the combination of positional encoding and word embedding. If we take the first two dimensions of a word embedding vector, we can draw it in a 2D plane.
Adding the first two dimensions of a positional encoding to this will point somewhere on a unit circle around the word embedding.
So, points on the unit circle represent the same word with different positions, instead of single point, but a distribution! For higher dimensions, it’s hard to imagine visually. As such, we won’t go into that.
However, there is one question: if word embedding vectors and positional encoding vectors are in similar magnitudes, the model could get confused about which are which. So, how does the model distinguish between word embeddings and positional encodings?
The word embedding layer has learnable parameters, so the optimization process can adjust the magnitudes of word embedding vectors if necessary. I’m not saying that is indeed happening, but it is capable of that. It is amazing that
The power of optimization is incredible.
We can identify each word position uniquely using positional encoding. But how does the model know the relative positions of words in a sentence?
The author explains their thought process:
We can rotate each hand in a clock to represent another time. In other words, we can manipulate one-word position to another using a rotation matrix which is a linear operation. Hence, a neural network can learn to understand relative word positions. As the author mentions, it was a hypothesis but sounds quite reasonable.
The author further explains why they chose sine and cosine functions:
If a model can identify relative positions of words by rotations, it should be able to detect any relative positions.
So, these are the reasons why they chose sine and cosine functions. They also tested the transformer with positional embedding, which encodes positional information using a network layer with learnable parameters.
So, they went for the simpler (fixed) encoding method.
The encoder in the transformer takes input word embeddings containing positional encodings and applies further operations. If so, why doesn’t the model lose the word positions in the process?
The transformer circumvents this problem by residual connections:
The positional encodings carry over through multiple layers in the encoder and decoder.
Naoki. Transformer’s Positional Encoding: How Does It Know Word Positions Without Recurrence?