Hit the star π if you like the repo π
.
π¨ Exciting News!
The blog version of this repository is now live! π
Check it out here: Read the Blog π
The implementation is built by taking the official reserch paper as basis. This repo will be aimed to provide insights to me and to other how really Transformers work, even at gradients level. This repository aimed to implementing a 1-layer Transformer architecure with with no dropouts, custom optimisation and layers. This will not only enable the users to build upon this repo but will also be able to do toy experiments; as we all are GPU poor π. This light architecture can be easily understood and used by the community to inquire more about how tranformers learn and generalize? Furthermore, different experiments such as grokking for very simple experiments like prediting addition or other operation on numbers. This could also be used by others better understand Transformer.
[Use Big-Screen for better view]
If you have any question or you would like me to add something, then please feel free to create an issue.
def position_embedding(self, sent: Tensor, d_model: int) -> Tensor:
pe = np.zeros((sent.size()[0], d_model))
for pos, word in enumerate(sent.size()[0]):
for i in range(0,d_model, 2):
pe[pos][i] = math.sin(pos/(10000**(2*i/d_model)))
pe[pos][i+1] = math.cos(pos/(10000**(2*i/d_model)))
# adding positional encoding to the sentence, that will be passed into the transformer (encoder/decoder).
final_sent = sent + pe
return t.tensor(final_sent)
Figure 1 - The 128-dimensional positonal encoding for a sentence with the maximum lenght of 50. Each row represents the embedding vector [1].
The other components of the encoder are explained in detail below, especially:
Multiple matrices, such as W_q
, W_k
, and W_v
, are used to extract content from the input embeddings, transforming them into queries, keys, and values. The use of multiple heads allows the model to capture different aspects of the context for each query. This can be likened to a student having the ability to ask several questions (multiple heads) versus only being allowed to ask a single question (one head), thus enabling a richer understanding.
All heads outputs are finally concatenated and filtered through a linear layer, which projects it into the dimension equivalent to a single head.
def self_attention(self):
query = self.W_q(self.input_embedding).view(1, self.num_heads, self.seq_len, self.q_dim) # (1, 2, 4, 512)
key = self.W_k(self.input_embedding).view(1, self.num_heads, self.seq_len, self.k_dim) # (1, 2, 4, 512)
value = self.W_v(self.input_embedding).view(1, self.num_heads, self.seq_len, self.v_dim) # (1, 2, 4, 512)
# we will take the dot product of query and key to get the similarity score.
attention_score = t.softmax(t.matmul(query, key.transpose(2,3))/t.sqrt(t.tensor(self.k_dim)), dim=-1) # (1, 2, 4, 4)
overall_attention = t.matmul(attention_score, value)
overall_attention = t.cat(overall_attention).view(1, self.seq_len, self.k_dim*self.num_heads) # (1, 4, 512)
final_attention = self.W_o(overall_attention) # (1, 4, 512)
return final_attention
Layer normalization is a crucial technique used in transformers that helps stabilize and accelerate the training process, which normalizes the inputs to each layer, ensuring that they have a mean of zero and a standard deviation of one. This helps to stabilize the distribution of activations during training, which can lead to more consistent learning. As a result, producing the following effects on the network learning:
self.layer_norm = nn.LayerNorm(512)
The feed-forward layers module consist of 2 layers with a linear activation ReLU
between them.
The architecture of the module allows the former layer to project the output into a higher dimension, while the latter projects it into original space.
def ffn(self, x:Tensor) -> Tensor:
x1 = self.fc1(x)
x2 = self.relu(x1)
x3 = self.fc2(x2)
return x3
This is the whole code of Encoder:
class encoder:
def __init__(
self,
num_heads: int,
sent: Tensor) -> None:
super(encoder, self).__init__()
self.sent = sent
self.num_heads = num_heads
self.k_dim = 512; self.v_dim = 512; self.q_dim = 512
self.W_q = nn.Linear(512, self.k_dim*self.num_heads)
self.W_k = nn.Linear(512, self.k_dim*self.num_heads)
self.W_v = nn.Linear(512, self.v_dim*self.num_heads)
self.seq_len = sent.size()[0]
assert self.seq_len == 4, "The sequence length should be 4."
self.W_o = nn.Linear(512*self.num_heads,512)
self.layer_norm = nn.LayerNorm(512)
self.fc1 = nn.Linear(512, 1024)
self.fc2 = nn.Linear(1024, 512)
self.relu = nn.ReLU()
# we will make two heads for multi-head attention
def self_attention(self):
query = self.W_q(self.input_embedding).view(1, self.num_heads, self.seq_len, self.q_dim) # (1, 2, 4, 512)
key = self.W_k(self.input_embedding).view(1, self.num_heads, self.seq_len, self.k_dim) # (1, 2, 4, 512)
value = self.W_v(self.input_embedding).view(1, self.num_heads, self.seq_len, self.v_dim) # (1, 2, 4, 512)
# we will take the dot product of query and key to get the similarity score.
attention_score = t.softmax(t.matmul(query, key.transpose(2,3))/t.sqrt(t.tensor(self.k_dim)), dim=-1) # (1, 2, 4, 4)
overall_attention = t.matmul(attention_score, value)
overall_attention = t.cat(overall_attention).view(1, self.seq_len, self.k_dim*self.num_heads) # (1, 4, 512)
final_attention = self.W_o(overall_attention) # (1, 4, 512)
return final_attention
def ffn(self, x:Tensor) -> Tensor:
x1 = self.fc1(x)
x2 = self.relu(x1)
x3 = self.fc2(x2)
return x3
def forward(self):
self.input_embedding = self.position_embedding(self.sent, 4)
multi_head_attn = self.self_attention()
multi_head_attn_out = self.W_o(multi_head_attn) #(4,2048) * (2048, 4) = (4, 4)
input_embedding = self.layer_norm(multi_head_attn_out + self.input_embedding)
ffn_out = self.ffn(input_embedding)
encoder_out = self.layer_norm(ffn_out + input_embedding)
return encoder_out
The decoder architecture is often one of the least explained aspects in related materials, with limited information available about how the decoder interacts with the encoder. Therefore, this section aims to provide a detailed explanation to clarify the decoder's role and operation. Certain components that are identical to the encoder, such as the Feed-Forward Network and Multi-Head Attention, are not covered in depth here.
The decoder takes the output or ground-truth sentence as input and adds positional embeddings before passing it through the masked multi-head attention module.
In the masked multi-head attention module, the input is the sentence with added positional embeddings. The attention mechanism works similarly to the encoder, using "query," "key," and "value." However, the key difference is the inclusion of a masking tensor. This mask ensures that the model cannot access future token representations when predicting the next token, relying only on past information.
The masking tensor is constructed with values of $0$ and $-\infty$ for the upper right part of the matrix. This is added to the product of the query and key, keeping the lower left values (including the diagonal) intact while setting the upper right values to infinity. After applying softmax, these $-\infty$ elements are converted to $0$, effectively hiding future tokens for each word.
Finally, we incorporate the values, of the word with non-zero values by taking product between masked key-query product values t.matmul(B1,V1)
as output of this module. Finally, similar to multi-head attention, the information for all the heads are concatenated and filtered through a linear layer.
def masked_multi_head_attention(self, encoder_output: Tensor, dec_attn: Tensor) -> Tensor:
'''
We are making this as the masked multi-head attention, as we are masking the future words in the sentence.
For reference you can look at its diagram before implementation to get an intuition about it.
'''
query = self.W_q_m(dec_attn).view(1, self.num_heads, self.seq_len, self.q_dim)
key = self.W_k_v_m(encoder_output).view(1, self.num_heads, self.seq_len, self.k_dim)
value = self.W_k_v_m(encoder_output).view(1, self.num_heads, self.seq_len, self.v_dim)
attention_score = t.matmul(query, key.transpose(2,3))/t.sqrt(t.tensor(self.k_dim))
# Adding the attention score with the masking tensor to mask the future words in the sentence.
attention_score = t.softmax((attention_score + self.masking_tensor), dim = -1)
overall_attention = t.matmul(attention_score, value)
overall_attention = t.cat(overall_attention).view(1, self.seq_len, self.k_dim*self.num_heads)
final_attention = self.W_o_m(overall_attention)
return final_attention
In this module, the multi-head attention functions similarly to how it does in the decoder. The key difference lies in the inputs it receives. The encoder_output
is used to construct the keys and values, while the query is derived from the output of the masked multi-head attention module. This setup allows the model to incorporate information from the input sentence (through the keys and values) while utilizing the available context from the ground-truth to predict the next word.
def multi_head_attention(self, encoder_output: Tensor, dec_attn: Tensor) -> Tensor:
'''
We are making this function for just 1 sample.
The words of which will be computed to have similarity with each other.
The query, key, and value are the three vectors that are used to computed with the embedding layer dim to assign a new dim.
'''
query = self.W_q(dec_attn).view(1, self.num_heads, self.seq_len, self.q_dim)
key = self.W_k(encoder_output).view(1, self.num_heads, self.seq_len, self.k_dim)
value = self.W_v(encoder_output).view(1, self.num_heads, self.seq_len, self.v_dim)
attention_score = t.matmul(query, key.transpose(2,3))/t.sqrt(t.tensor(self.k_dim))
# Adding the attention score with the masking tensor to mask the future words in the sentence.
attention_score = t.softmax((attention_score + self.masking_tensor), dim = -1)
overall_attention = t.matmul(attention_score, value)
overall_attention = t.cat(overall_attention).view(1, self.seq_len, self.k_dim*self.num_heads)
final_attention = self.W_o_m(overall_attention)
return final_attention
The whole code for decoder can be found below:
class decoder:
def __init__(self, num_heads: int, out_sent: Tensor, encoder_output: Tensor) -> None:
super(decoder, self).__init__()
self.out_sent = out_sent
self.num_heads = num_heads
'''
We are making output dim same as the input dim,
as we are taking 2 heads for multi-head attention,
as a result, 1024/2 = 512 for the output dim.
The will become 1024 when it will be concatenated.
'''
self.encoder_output = encoder_output
self.k_dim = 512; self.v_dim = 512; self.q_dim = 512
self.W_q = nn.Linear(512, self.k_dim*self.num_heads)
self.W_k = nn.Linear(512, self.k_dim*self.num_heads)
self.W_v = nn.Linear(512, self.v_dim*self.num_heads)
self.W_q_m = nn.Linear(512, self.k_dim*self.num_heads)
self.W_k_m = nn.Linear(512, self.k_dim*self.num_heads)
self.W_v_m = nn.Linear(512, self.v_dim*self.num_heads)
self.masking_tensor = t.triu(t.full((1, self.num_heads, self.seq_len, self.seq_len), float("inf")), diagonal = 1)
self.seq_len = out_sent.size()[0]
assert self.seq_len == 4, "The sequence length should be 4."
self.W_o = nn.Linear(512*self.num_heads,512)
self.W_o_m = nn.Linear(512*self.num_heads,512)
self.layer_norm = nn.LayerNorm(512)
self.fc1 = nn.Linear(512, 1024)
self.fc2 = nn.Linear(1024, 512)
self.relu = nn.ReLU()
def position_embedding(self, sent: Tensor, d_model: int) -> Tensor:
'''
Defined in depth in the encoder.py file.
'''
pe = np.zeros((sent.size()[0], d_model))
for pos, word in enumerate(sent.size()[0]):
for i in range(0,d_model, 2):
pe[pos][i] = math.sin(pos/(10000**(2*i/d_model)))
pe[pos][i+1] = math.cos(pos/(10000**(2*i/d_model)))
# adding positional encoding to the sentence, that will be passed into the transformer (encoder/decoder).
final_sent = sent + pe
return t.tensor(final_sent)
def ffn(self, x:Tensor) -> Tensor:
x1 = self.fc1(x)
x2 = self.relu(x1)
x3 = self.fc2(x2)
return x3
def masked_multi_head_attention(self) -> Tensor:
'''
We are making this as the masked multi-head attention, as we are masking the future words in the sentence.
For reference you can look at its diagram before implementation to get an intuition about it.
'''
query = self.W_q_m(self.input_embedding).view(1, self.num_heads, self.seq_len, self.q_dim) # (1, 2, 4, 512)
key = self.W_k_m(self.input_embedding).view(1, self.num_heads, self.seq_len, self.k_dim) # (1, 2, 4, 512)
value = self.W_v_m(self.input_embedding).view(1, self.num_heads, self.seq_len, self.v_dim) # (1, 2, 4, 512)
# we will take the dot product of query and key to get the similarity score.
attention_score = t.softmax(t.matmul(query, key.transpose(2,3))/t.sqrt(t.tensor(self.k_dim)), dim=-1) # (1, 2, 4, 4)
overall_attention = t.matmul(attention_score, value)
overall_attention = t.cat(overall_attention).view(1, self.seq_len, self.k_dim*self.num_heads) # (1, 4, 512)
final_attention = self.W_o(overall_attention) # (1, 4, 512)
return final_attention
def multi_head_attention(self, encoder_output: Tensor, dec_attn: Tensor) -> Tensor:
'''
We are making this function for just 1 sample.
The words of which will be computed to have similarity with each other.
The query, key, and value are the three vectors that are used to computed with the embedding layer dim to assign a new dim.
'''
query = self.W_q(dec_attn).view(1, self.num_heads, self.seq_len, self.q_dim)
key = self.W_k(encoder_output).view(1, self.num_heads, self.seq_len, self.k_dim)
value = self.W_v(encoder_output).view(1, self.num_heads, self.seq_len, self.v_dim)
attention_score = t.matmul(query, key.transpose(2,3))/t.sqrt(t.tensor(self.k_dim))
# Adding the attention score with the masking tensor to mask the future words in the sentence.
attention_score = t.softmax((attention_score + self.masking_tensor), dim = -1)
overall_attention = t.matmul(attention_score, value)
overall_attention = t.cat(overall_attention).view(1, self.seq_len, self.k_dim*self.num_heads)
final_attention = self.W_o_m(overall_attention)
return final_attention
def forward(self) -> Tensor:
x = self.input_embedding = self.position_embedding(self.out_sent, 512)
x_ = self.masked_multi_head_attention()
x = self.layer_norm(x_ + x)
x_ = self.multi_head_attention(self.encoder_output, x)
x = self.layer_norm(x + x_)
x_ = self.ffn(x)
x = self.layer_norm(x_ + x)
return x
I am grateful to Dr. Michael Sklar and Atif Hassan for helping me during the prepartion of this repository. I am also grateful to family, friends and online resources mentioned: