Experimenting in #32 , I gathered that it might be good to support the official gpt2 baseline, so here it is.
Some notes:
gpt2 is not 100% reproducing the official huggingface implementation, most probably because of slight numerical differences between nn.Linear (ours) and nn.Conv1D (huggingface)
addition of a "TRANSPOSE" mechanism in convert_HF (again, Linear vs Conv1D)
the hellaswag tool is a bit janky
addition of "Learned" position_encoding, some additional factorization around this might be good
⚠️ modification of the default left_padding behaviour (might still be improved)
@funboarder13920 @l-k-11235 this will conflict with #26 and potential future work there
Experimenting in #32 , I gathered that it might be good to support the official gpt2 baseline, so here it is.
Some notes:
addition of a "TRANSPOSE" mechanism in convert_HF (again, Linear vs Conv1D)@funboarder13920 @l-k-11235 this will conflict with #26 and potential future work there