bclarkson-code / Tricycle

Autograd to GPT-2 completely from scratch
104 stars 9 forks source link

Add embedding layer #36

Closed bclarkson-code closed 5 months ago

bclarkson-code commented 5 months ago

In the current GPT implementation, embedding tokens is done by one-hot encoding them and passing them through a linear layer. Because the tokens are one-hot encoded, we don't actually need to do the full matrix multiplication and can instead do a dictionary lookup.

We can encapsulate this in an Embedding layer which should significantly reduce both memory usage and computational load