Replicating the Training Data Order

Hi,

I've been trying to replicate the training data order for Pythia models, and this issue is an accumulation of what I've learned and what issues I've been facing in the process. I would really appreciate if any of the original authors who are more familiar with the code verify my understanding and also help me answer the questions that I'm stuck at.

Since one big appeal of this amazing project for me was the access to how LLMs evolve over time, I think it's crucial that replicating training data order is something that doesn't require too many bells and whistles. Understandably, I can see that the authors are actively working on improving this repo. So this is also my attempt to help by cobbling together what exactly is being understood by the reader of the repo from the instructions, and all major issues present in the code right now when trying to replicate the training data order. Now back to the main issue.

Attempt 1

My first attempt at recreating the training data order was from the following section of the repo - https://github.com/EleutherAI/pythia#reproducing-training

This section suggests downloading the tokenized version of the pile and then unsharding it, to 'guarantee preservation of the data order seen by the Pythia models'. Am I correct in interrupting this as saying that the downloaded dataset is the exact data order seen by Pythia models during training? In other words, no data shuffling is required? I have my doubts because this section continues to show how to reproduce training, which might involve further data shuffling. But I'm going to assume for now that this downloaded dataset is indeed the exact training data order seen by the model.

Now that we assume we have the exact order of documents seen by these models during training, the next step is to load them in the form of batches. This is where the file utils/mmap_dataset.py comes into play. Now, I believe there are some bugs in this dataloader. More specifically, following my understanding from this comment (https://github.com/EleutherAI/pythia/issues/123#issuecomment-1791292877), the data loader,

Does not take into account the fact that documents are combined together with 'EOD' tokens in between. It simply concatenates them all together, one right after the other.
Does not take into account that the dataset between some start iteration and end iteration is not exactly divisible by 2049, in other words would not divide into a perfect set of batches.

I have attached two files here, namely a modified version of mmap_dataset.py which adds 'EOD' tokens and does not 'reshape' the output, and another file called pythia_utils.py which contains a wrapper on top of mmap_dataset.py to load any 'subset' of data from some start iteration to some end iteration. Hopefully, it can be helpful to those who might be struggling with the current mmap_dataset.py, and for the original authors, maybe they can help me verify if my changes are indeed correct and help them update their code.

Link to the two files I mentioned -- https://drive.google.com/drive/folders/16C7AXtyJ6ASiM4I8rKKXy8LLToLzADh3?usp=sharing

This ends my first attempt at replicating the training data order. This was before the new set of instructions were added to the readme in the last 1-2 weeks. Note that, however, this attempt is limited to the deduplicated version, because the 'tokenized dataset' is only provided for the deduplicated version.

Attempt 2

I was excited to see a new set of instructions added to explore the dataset - https://github.com/EleutherAI/pythia/tree/main#exploring-the-dataset

However, looking into the details, it seems like this only downloads the pile dataset BEFORE shuffling, and thus isn't actually useful when it comes to the training data order for Pythia. Am I correct in my interpretation?

And if that is the case, the first sentence of this section, 'We provide a tool to view particular portions of the training dataloader used by all models during training', is not actually true. This data isn't the correct order of data seen by Pythia, but just some other data order of the Pile dataset.

This ends my second attempt at replicating the training data order, which unfortunately wasn't as successful as the first one.

Final Thoughts

I'm not aware of any potential behind-the-scenes issues, so purely from the viewpoint of a user, I really like the data loader setup from my first attempt (given, of course, the authors can verify that my understanding is indeed correct). If the authors can also release the tokenized version of the complete pile dataset (with duplicates) already shuffled, just like the deduplicated version is shuffled (again, maybe I'm wrong?), it'd be very easy to just quickstart and explore the training data order for various pythia models.

As of now, I don't understand what the correct set of instructions is that will give me, for instance, the exact data seen by the model at say training step 100000.

EleutherAI / pythia