In the current GPT implementation, embedding tokens is done by one-hot encoding them and passing them through a linear layer. Because the tokens are one-hot encoded, we don't actually need to do the full matrix multiplication and can instead do a dictionary lookup.
We can encapsulate this in an Embedding layer which should significantly reduce both memory usage and computational load
In the current GPT implementation, embedding tokens is done by one-hot encoding them and passing them through a linear layer. Because the tokens are one-hot encoded, we don't actually need to do the full matrix multiplication and can instead do a dictionary lookup.
We can encapsulate this in an Embedding layer which should significantly reduce both memory usage and computational load