Apply a linear transformation followed by a non-linear activation and another linear transformation:
[
\mathbf{h}_1 = \text{ReLU}(\mathbf{z} W_1 + b_1)
]
[
\mathbf{h}_2 = \mathbf{h}_1 W_2 + b_2
]
Assume (W1 \in \mathbb{R}^{d \times d{\text{ff}}}), (W2 \in \mathbb{R}^{d{\text{ff}} \times d}), and (d_{\text{ff}}) is the dimension of the feed-forward layer.
Repeat the self-attention and FFN steps for each Transformer block.
4. Output Generation
Final Linear Transformation:
The output of the last Transformer block (\mathbf{h}_L) is transformed into logits for each token in the vocabulary:
[
\mathbf{logits} = \mathbf{h}L W{\text{out}} + b_{\text{out}}
]
Assume (W{\text{out}} \in \mathbb{R}^{d \times V}) and (b{\text{out}} \in \mathbb{R}^V).
Softmax to Obtain Probabilities:
[
\mathbf{p} = \text{softmax}(\mathbf{logits})
]
The softmax function converts logits into a probability distribution over the vocabulary.
5. Post-Processing
Select Next Token:
The next token is selected based on the highest probability in (\mathbf{p}):
[
\text{next_token_id} = \arg\max(\mathbf{p})
]
Assume the token with the highest probability corresponds to token ID 23 (which might correspond to "world").
Convert Token ID to Text:
The token ID 23 is converted back to its corresponding word, resulting in the word "world".
Summary
To summarize, given the input "Hello":
Token ID 42 is converted to its embedding (\mathbf{e}_0).
The embedding is processed through multiple Transformer blocks using self-attention and feed-forward networks.
The final output is transformed to logits and passed through softmax to obtain a probability distribution.
The most probable next token ID is selected and converted back to text, resulting in "world".
The process can then repeat for generating subsequent tokens.
Let's walk through the steps of how a GPT model generates text, using a simple mathematical approach.
Step-by-Step Illustration
1. Initialization and Loading the Model
2. Input Preprocessing
3. Forward Pass (Inference)
Input Embedding: The initial input embedding vector is (\mathbf{x}_0 = \mathbf{e}_0).
Transformer Block Operations:
Self-Attention Mechanism:
Feed-Forward Network (FFN):
Repeat the self-attention and FFN steps for each Transformer block.
4. Output Generation
Final Linear Transformation:
Softmax to Obtain Probabilities: [ \mathbf{p} = \text{softmax}(\mathbf{logits}) ]
5. Post-Processing
Summary
To summarize, given the input "Hello":
The process can then repeat for generating subsequent tokens.