Check that our implementation matches a real llama implementation

We need to validate that our llama implementation doesn't have bugs in it. The reasonable way to do this is to compare its outputs on a set of inputs to that of the HF llama implementation.

Note that we shouldn't expect an exact match; our rotary implementation differs from that of the rotary setup used in the HF implementation. Hopefully they are somewhat close though. May need to adjust rotary hyperparams.

If these don't match, it may be worth implementing the more complex setup they have in HF (or just training with the HF model directly).

Note that this PR will involve sorting out the from_pretrained method to work for llama (rather than the copied over gpt2 code that is currently in there).

danbraunai / simple_stories_train

Check that our implementation matches a real llama implementation #2