tiny stories

Description

Adds support for sliding attention necessary for GPTNeoX style models (including tiny stories)
Fixes a minor bug in how we pack sequences that would give models higher loss than deserved (shouldn't matter for rib calculations, just ablations)

Related Issue

Motivation and Context

It's nice to be able to test our methods on a range of models. Tiny-stories is probably a better choice than pythia-14M for many experiments, as:

It's smaller, so code should run quicker
It's probably more competent on it's training distribution than pythia is on the pile.
The training distribution is more human-intuitive. Ideally it's less necessary to read lots of the training distribution to intuitively understand the model's task.

How Has This Been Tested?

Added tinystories to various tests, often piggybacking on gpt2. These check the sequential transformer is loaded properly and the output is consistent with the TL version.

I've also manually checked the loss is comparable with the published loss in the paper. We get a loss of 2.40 vs 2.38 in the paper, so it's not identical. I'd guess this is from different tokenisation, it's not clear that they used packed sequences. Still, it's good enough performance for me to think the model is basically doing it's job.

Does this PR introduce a breaking change?

No.

ApolloResearch / rib

Tiny stories support #232