Closed davmacario closed 4 months ago
As explained here, in A. Karpathy's repository, he used Linear layers in the model, while the GPT2 implementation from Huggingface uses Conv1D layers.
The result is the same, but this requires less operations when loading from pretrained, plus it's compliant with the Transformers library.
Not pertinent anymore - switched to LitGPT
As explained here, in A. Karpathy's repository, he used Linear layers in the model, while the GPT2 implementation from Huggingface uses Conv1D layers.
The result is the same, but this requires less operations when loading from pretrained, plus it's compliant with the Transformers library.