NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Naoki | Transformer’s Positional Encoding: How Does It Know Word Positions Without Recurrence? #63

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Naoki. Transformer’s Positional Encoding: How Does It Know Word Positions Without Recurrence?

NorbertZheng commented 1 year ago

Overview

In 2017, Vaswani et al. published a paper titled “Attention Is All You Need” for the NeurIPS conference. They introduced the original transformer architecture for machine translation, performing better and faster than RNN encoder-decoder models, which were mainstream.

The big transformer model achieved SOTA (state-of-the-art) performance on NLP tasks.

The transformer is more performant and straightforward than other models that it superseded. The author says:

However, the word “simple” is probably not how most readers feel when looking at the architecture diagram first. image Figure 1: The Transformer — model architecture

Reading the paper is inspirational but provokes many questions, and searching for the answers does not always yield a satisfactory explanation.

Detail-oriented readers might have many doubts about positional encoding, which we discuss in this article with the following questions:

NorbertZheng commented 1 year ago

Why Positional Encoding?

In 3.5 Positional Encoding of the paper, the author explains why they need to encode the position of each token (word, special character, or whatever distinct unit):

Historically, researchers used RNNs (Recurrent Neural Networks) to keep track of ordinal dependencies in sequential data. Later, researchers incorporated attention mechanisms into RNN encoder-decoder architectures to better extract contextual information for each word.

On the contrary, the transformer’s encoder-decoder architecture uses attention mechanisms without recurrence and convolution. That’s what the paper title “Attention Is All You Need” means.

However, without positional information, an attention-only model might believe the following two sentences have the same semantics:

That’d be a bad thing for machine translation models. So, yes, we need to encode word positions (note: I’m using ‘token’ and ‘word’ interchangeably).

The question is: how?

NorbertZheng commented 1 year ago

Why Add Positional Encoding To Word Embeddings?

Vaswani et al. explain how they put positional encoding (PE):

Let’s look at where inputs feed to the encoder. image The positional encoding happens after input word embedding and before the encoder.

The author explains further:

The base transformer uses word embeddings of 512 dimensions (elements). Therefore, the positional encoding also has 512 elements, so we can sum a word embedding vector and a positional encoding vector by element-wise addition. image

image Using a concatenation of two vectors, the model sees positional encoding vector independent from word embedding, which might make learning (optimization) easier.

However, there are some drawbacks. It increases the memory footprint, and

How about reducing the number of elements in positional encoding? Concatenation does not require two vectors to have equal dimensions. So, we can adjust the size of the encoding vector just big enough to support positional encoding, can’t we? One problem with this approach is that the number of dimensions in the positional encoding will become another hyper-parameter to tune.

Granted that we can think of various alternative approaches to positional encoding, the truth is they pushed positional encoding vectors into the same space as the word embedding vectors, and it worked.

The model can learn to use the positional information without getting confused with the embedding (semantic) information. It’s hard to imagine what’s happening inside the network, yet it works.

Moreover, it benefits from

Also, we don’t need an additional hyper-parameter. So, let’s accept this as a viable method and continue reading the paper.

NorbertZheng commented 1 year ago

How To Encode Word Positions?

How do we calculate positional encoding? The author explains it as follows:

$$ \begin{aligned} PE{(pos,2i)}&=\sin(\frac{pos}{10000^{\frac{2i}{d{model}}}})\ PE{(pos,2i+1)}&=\cos(\frac{pos}{10000^{\frac{2i}{d{model}}}}), \end{aligned} $$

3.5 Positional Encoding

where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ (i.e. $\sin(pos)$, where $i$ is 0) to $10000 \cdot 2\pi$ (i.e. $\sin(\frac{pos}{10000})$, where $i$ is 9998), but not uniformly.

The above formulas look complex. So, let’s expand it, assuming the model uses $d_{model}=512$ dimensions for positional encoding (and word embedding).

We represent each token as a vector of 512 dimensions (from element 0 to element 511).

A position (pos) is an integer from 0 to a pre-defined maximum number of tokens in a sentence (a.k.a. max length) minus 1. For example, if the max length is 128, positions are 0 to 127.

For every pair of elements, i.e. $\sin(\cdot)$ and $\cos(\cdot)$, we calculate the angle (in radians) as follows:

$$ \frac{pos}{10000^{\frac{2i}{512}}}. $$

Since we have 512 dimensions, we have 256 pairs of sine and cosine values. As such, the value of i goes from 0 to 255.

Considering everything, we calculate each element of the positional encoding vector for a position (pos) as follows: image

So, the encoding for word position 0 is: image Position 0 is a vector with alternatively repeating 0s and 1s.

Position 1 is as follows: image Note: I truncated element values to the fourth decimal place.

We can see that the sine and cosine pair at the end is close to 0 and 1, which makes sense since the denominator inside the sine and cosine functions becomes very big, that the angle is pretty much zero.

Looking at these numbers makes not much sense. Let’s visualize them and see if we can find any patterns.

NorbertZheng commented 1 year ago

How To Visualize Positional Encoding?

We can write a short Python script to generate all the positional encoding values:

import math
import numpy as np

MAX_SEQ_LEN = 128 # maximum length of a sentence
d_model = 512 # word embedding (and positional encoding) dimensions

# pre-allocates vectors with zeros
PE = np.zeros((MAX_SEQ_LEN, d_model))

# for each position, and for each dimension
for pos in range(MAX_SEQ_LEN):
    for i in range(d_model//2):
        theta = pos / (10000 ** ((2*i)/d_model))
        PE[pos, 2*i ] = math.sin(theta)
        PE[pos, 2*i + 1] = math.cos(theta)

Instead of using for-loop, we can take advantage of NumPy’s parallelizable operations (inspired by the PyTorch tutorial):

# replace the above for loop
pos = np.arange(MAX_SEQ_LEN)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))

PE[:, 0::2] = np.sin(pos * div_term)
PE[:, 1::2] = np.cos(pos * div_term)

np.arange(0, d_model, 2) generates integers from 0 to d_model - 2. Below is an example where d_model = 10.

>>> np.arange(0, 10, 2)
array([0, 2, 4, 6, 8])

So, we are calculating the below for each 2i from the np.arange term:

$$ \begin{aligned} \exp \left( \dfrac{ 2i \times (-\log 10000) }{ d{model} } \right) &= \exp \left( \log 10000^{ -\dfrac{2i}{ d{model} } } \right) \ &= 10000 ^{- \dfrac{ 2i }{ d{model} } } \ &= \dfrac{1}{10000 ^{\dfrac{ 2i }{ d{model} } } } \end{aligned} $$

We use this as the division term in the positional encoding formula:

$$e Word Positions? How do we calculate positional encoding? The author explains it as follows:

$$ \begin{aligned} PE{(pos,2i)}&=\sin(\frac{pos}{10000^{\frac{2i}{d{model}}}})\ PE{(pos,2i+1)}&=\cos(\frac{pos}{10000^{\frac{2i}{d{model}}}}), \end{aligned} $$

Both codes produce the same values. I like the one with for-loops better because it’s easy to understand, and it runs fast enough to experiment with a small MAX_SEQ_LEN. For actual implementations, it’d be better to use the faster version. In either case, we only need to generate positional encoding values once.

Let’s draw the positional encoding values as a 2D image:

import matplotlib.pyplot as plt

im = plt.imshow(PE, aspect=’auto’)

plt.title(“Positional Encoding”)
plt.xlabel(“Encoding Dimension”)
plt.ylabel(“Position Index”)
plt.colorbar(im)
plt.show()

image The top row is position 0, and the bottom is position 127 as I used the max length of a sentence = 128. The x-axis is for vector dimension index from 0 to 511.

The values are between -1 and 1 since they are from sine and cosine functions. Darker colors indicate values closer to -1, and brighter (yellow) colors are closer to 1. Green colors indicate values in between. For example, dark green colors indicate values around 0.

As expected, the values towards the right (the end of vector elements) seem to have alternating 0s (dark green) and 1s (yellow). But it’s hard to see due to fine pixels.

So, I created a smaller version of the encoding matrix with the $max\ length=16$ and the $d_{model}=32$. image

Encoding vectors with bigger position indices have less number of alternating 0s and 1s. Each row appears to have a unique pattern, so I would imagine that the model could somehow distinguish between them. We’ll see more details of positional encoding calculation later on.

Next, let’s interpret what positional encoding represents.

NorbertZheng commented 1 year ago

How To Interpret Positional Encoding?

To have more intuition about positional encoding, let’s look at it from a different perspective. As we know, positional encoding has pairs of sine and cosine functions. If we take any pair and plot positions on a 2D plane, they are all on a unit circle.

image

When we increase the value pos, the point moves clockwise. It’s like a clock hand that moves as word position increases. In a positional encoding vector with 512 dimensions, we have 256 hands.

Looking at the first two pairs in positional encoding, the second-hand moves slower than the first because the denominator increases as the dimensional index increases. image

So, we can interpret

Towards the end of a positional encoding vector, hands move slower and slower when increasing position index (pos).

The hands near the end of the dimensions are slow because the denominator is large. The angles are approximately zeros there unless the position index is significant enough.

We can confirm that by looking at the positional encoding image again. Below is the smaller version of the positional encoding image.

image

So, we can think of

NorbertZheng commented 1 year ago

Now, let’s try interpreting the combination of positional encoding and word embedding. If we take the first two dimensions of a word embedding vector, we can draw it in a 2D plane. image

Adding the first two dimensions of a positional encoding to this will point somewhere on a unit circle around the word embedding. image

So, points on the unit circle represent the same word with different positions, instead of single point, but a distribution! image For higher dimensions, it’s hard to imagine visually. As such, we won’t go into that.

However, there is one question: if word embedding vectors and positional encoding vectors are in similar magnitudes, the model could get confused about which are which. So, how does the model distinguish between word embeddings and positional encodings?

The word embedding layer has learnable parameters, so the optimization process can adjust the magnitudes of word embedding vectors if necessary. I’m not saying that is indeed happening, but it is capable of that. It is amazing that

The power of optimization is incredible.

NorbertZheng commented 1 year ago

Why Sine and Cosine Functions?

We can identify each word position uniquely using positional encoding. But how does the model know the relative positions of words in a sentence? image

The author explains their thought process:

We can rotate each hand in a clock to represent another time. In other words, we can manipulate one-word position to another using a rotation matrix which is a linear operation. Hence, a neural network can learn to understand relative word positions. As the author mentions, it was a hypothesis but sounds quite reasonable.

The author further explains why they chose sine and cosine functions:

If a model can identify relative positions of words by rotations, it should be able to detect any relative positions.

So, these are the reasons why they chose sine and cosine functions. They also tested the transformer with positional embedding, which encodes positional information using a network layer with learnable parameters.

So, they went for the simpler (fixed) encoding method.

NorbertZheng commented 1 year ago

How Positions Survive through Multiple Layers?

The encoder in the transformer takes input word embeddings containing positional encodings and applies further operations. If so, why doesn’t the model lose the word positions in the process?

The transformer circumvents this problem by residual connections: image

The positional encodings carry over through multiple layers in the encoder and decoder.

NorbertZheng commented 1 year ago

References