Batch Viewer : Why Sequence Length 2049?

prakharg24 commented 9 months ago

Hi, I am using utils/batch_viewer.py to iterate through Pythia's training data and calculate some batch-level statistics. Firstly, there are some gaps between the actual code in batch_viewer.py and the expected code according to the README (For example, it doesn't take any 'config file' as input, the 'load file' name needs to be supplied separately, etc.). But these differences were obvious enough that I could fix them on my end and run the code.

However, it's the final step of saving the data after loading the buffer that I'm a bit confused about. I have two questions,

Given that each 'sequence' in the dataset is of a different length, can someone confirm that the training is performed by simply concatenating the whole dataset as a single sequence of tokens, and then dividing it into sentences and batches? This would mean that some 'sequences' are broken into different sentences or batches, and even one 'sentence' of 2048 tokens might contain multiple actual dataset sequences. I believe this is how most LLMs are trained, but I couldn't find the exact details in the paper.
The MMapDataset function attempts to reshape the final concatenated sequence into (-1, 2049). I don't understand why 2049. Isn't the sentence length supposed to be 2048? I'm new to the specifics of how LLMs are trained so I may be missing some trivial detail here, but I don't understand why 2048 became 2049.

haileyschoelkopf commented 9 months ago

Hi, thanks for your interest!

As a sanity check of the new batch_viewer.py, you can also use the older version here: https://github.com/EleutherAI/pythia/commit/899add0f1c71cb27dbf5a7594202584416c0b424 @uSaiPrashanth will be bringing the documentation in the README in line with this updated version.

Yes, this is correct! We tokenize all documents, shuffle them, and concatenate all documents, separating them by a single EOD token . Thus each sample or batch may not start or end with an EOD token, and sample boundaries do not respect document boundaries, nor do we avoid cross-attending to different documents in a context window. This is standard for many public LLM training codebases. For the ground truth on this, v1.0 of the neox repository is a good reference. https://github.com/EleutherAI/gpt-neox/tree/v1.0
The reason for a sequence length of 2049 is because target tokens consist of the input tokens, but left-shifted by one position (so tokens [0 1 2 3] are seen and used to predict token 4 as target, and so on.) So the first 2048 tokens (only excluding token 2049) of the 2049-token window in a sample are used as inputs to the model, and the last 2048 tokens (only excluding the first token) are used as targets for calculating loss. Thus we calculate loss on 2048 tokens per sample.

prakharg24 commented 9 months ago

Hi @haileyschoelkopf Thank you for the response!

One more thing I'd like to clarify. Am I correct to assume that the tokenized data downloaded according to the instructions here is already shuffled - https://github.com/EleutherAI/pythia#reproducing-training ?

Simply put, to reproduce the exact batches used during training, I need to

Load all tokenized documents sequentially from the unsharded dataset created using the instructions in https://github.com/EleutherAI/pythia#reproducing-training
Concetanate them one after the other with 'EOD' tokens between.
Divide them into sequences of 2049 tokens, and then batches of 1024 sequences.
Finally, if I were to, for example, collect 100,000 such batches in order from the start, I will get the exact data seen by the model checkpoint at step100000 (Understandly, for deduplicated data, I'll need to start a 'second epoch' at some point, but I assume that is also done by simply assuming continuous transition from end of the dataset to the start again).

Thank you!!

itsnamgyu commented 7 months ago

I have the same questions as @prakharg24. Specifically:

There are two version of the deduped, pre-shuffled datasets mentioned in the README. As I understand:
- EleutherAI/pythia_deduped_pile_idxmaps contains tokenized documents without any EOD tokens. I've looked at the data and this seems to be the case.
- EleutherAI/pile-deduped-pythia-preshuffled contains tokenized documents with EOD tokens.
Is this correct?
If the data is divided into 2049-sized sequences naively, then the first token of each sequence will not be seen (as a label) by the model. Is this intended?

Can someone please help with this? @haileyschoelkopf

sujantkumarkv commented 6 months ago

I have similar doubts regarding the nature of data as @itsnamgyu @prakharg24..

If I want to only use a subset (say arXiv only) to train a pythia model, how do I download only those pretokenized data (including EOD tokens)?

Any input is appreciated. cc @haileyschoelkopf @crowsonkb @joshlk

haileyschoelkopf commented 6 months ago

Hi @itsnamgyu @prakharg24 hopefully I can answer some of your dataset questions here:

If the data is divided into 2049-sized sequences naively, then the first token of each sequence will not be seen (as a label) by the model. Is this intended?

This is correct. Because we do not train using any BOS tokens, there is no way to see the first token of a sequence as a label by the model. This is because one cannot feed in the empty string to a model (unless it was trained with a BOS token that can act as such. You could attempt to simulate this by passing EOD into the Pythia models, but I am unsure of the behavior that would result.)

If I want to only use a subset (say arXiv only) to train a pythia model, how do I download only those pretokenized data (including EOD tokens)?

@sujantkumarkv unfortunately, when tokenizing the Pile dataset, metadata about subsets are not retained. We don't currently have an easy way to train only on say the arXiv subset, and would recommend retokenizing that dataset alone separately using GPT-NeoX's prepare_data.py .

Regarding how to replicate training order:

If using EleutherAI/pile-deduped-pythia-preshuffled, then once you've downloaded and combined the shards in this, this should provide, when loaded using MMapIndexedDataset via the script at https://github.com/EleutherAI/pythia/blob/dc24af59cff8c8159a1d4b106393b39c39a1ef2e/utils/batch_viewer.py , will provide a dataset where dataset[i] is a length-2049 sequence of tokens, that will contain EOD tokens separating the end of one document from the beginning of the next.

To access the j-th batch item at step k of training, you can access dataset[(k * 1024) + j] which should give the context window seen at that item in training.

The EleutherAI/pythia_deduped_pile_idxmaps contains binidx files that can be used with the script at https://github.com/EleutherAI/pythia/blob/899add0f1c71cb27dbf5a7594202584416c0b424/utils/batch_viewer.py .

These binidx files contain the tokenized documents, prior to chopping them into the context windows seen during training.

Here, these binidx files must be loaded using megatron.data.gpt2_dataset.GPT2Dataset with the appropriate arguments, in order to perform shuffling via megatron's dataset code (as was done during training) and chop the documents appropriately into context windows.

We've updated the readme to hopefully make more clear how to use the preshuffled binidx files!

If you're looking to reproduce the Pythia training order, for 1) viewing the training data contents: we recommend using the preshuffled binidx files and using the most up-to-date README and batch_viewer.py . This lets you dump the context windows seen by Pythia directly to disk and is significantly faster than the old batch_viewer.py .

2) if you want to re-train Pythia, we recommend doing so in the GPT-NeoX library v1.0 and taking care to use the exact same config file as we provide for the Pythia models.

I hope that this is helpful!

itsnamgyu commented 6 months ago

Thanks so much for the detailed answer!

Just to clarify for other readers, I've confirmed that EleutherAI/pythia_deduped_pile_idxmaps does not have any EOD tokens (but please comment if I'm wrong).

pietrolesci commented 6 months ago

Further to @itsnamgyu comment, I confirm that pile-deduped-pythia-preshuffled does not have any EOD tokens (I checked a 100k sample, let me know if I missed anything).

itsnamgyu commented 6 months ago

@pietrolesci Actually according to the comment above, pile-deduped-pythia-preshuffled should have EOD tokens while EleutherAI/pythia_deduped_pile_idxmaps does not, so that is contradicting. Are you should you are referring to pile-deduped-pythia-preshuffled?

If using EleutherAI/pile-deduped-pythia-preshuffled, then once you've downloaded and combined the shards in this, this should provide, when loaded using MMapIndexedDataset via the script at https://github.com/EleutherAI/pythia/blob/dc24af59cff8c8159a1d4b106393b39c39a1ef2e/utils/batch_viewer.py , will provide a dataset where dataset[i] is a length-2049 sequence of tokens, that will contain EOD tokens separating the end of one document from the beginning of the next.

Note, batch_viewer.py does not have any code to add EOD tokens.

itsnamgyu commented 6 months ago

@haileyschoelkopf sorry for the trouble, but is EleutherAI/pythia_deduped_pile_idxmaps also pre-shuffled?

You mentioned your comment,

Here, these binidx files must be loaded using megatron.data.gpt2_dataset.GPT2Dataset with the appropriate arguments, in order to perform shuffling via megatron's dataset code (as was done during training) and chop the documents appropriately into context windows.

whereas https://github.com/EleutherAI/pythia#reproducing-training says about EleutherAI/pythia_deduped_pile_idxmaps

We recommend downloading this rather than retokenizing the Pile from scratch in order to guarantee preservation of the data order seen by the Pythia models

I'm training on EleutherAI/pythia_deduped_pile_idxmaps (while manually injecting EOD) and (1) some manual inspection and (2) the training loss suggests that it is in fact pre-shuffled.

Related to #127

pietrolesci commented 6 months ago

@pietrolesci Actually according to the comment above, pile-deduped-pythia-preshuffled should have EOD tokens while EleutherAI/pythia_deduped_pile_idxmaps does not, so that is contradicting. Are you should you are referring to pile-deduped-pythia-preshuffled?

Hi @itsnamgyu, I confirm that -- contrary to what is expected and described in the README -- the pile-deduped-pythia-preshuffled does NOT have an EOD token.

norabelrose commented 6 months ago

I also ran into the absence of EOD tokens just now 👀

will ping @haileyschoelkopf

M-HuangX commented 1 month ago

I have checked the preshuffled dataset. And find that the authentic seq_length is 2050 which is not equal to 2049 as described.

EleutherAI / pythia

Batch Viewer : Why Sequence Length 2049? #123