Closed prakharg24 closed 9 months ago
Hi, thanks for your interest!
As a sanity check of the new batch_viewer.py, you can also use the older version here: https://github.com/EleutherAI/pythia/commit/899add0f1c71cb27dbf5a7594202584416c0b424 @uSaiPrashanth will be bringing the documentation in the README in line with this updated version.
Yes, this is correct! We tokenize all documents, shuffle them, and concatenate all documents, separating them by a single EOD
token . Thus each sample or batch may not start or end with an EOD
token, and sample boundaries do not respect document boundaries, nor do we avoid cross-attending to different documents in a context window. This is standard for many public LLM training codebases. For the ground truth on this, v1.0 of the neox repository is a good reference. https://github.com/EleutherAI/gpt-neox/tree/v1.0
The reason for a sequence length of 2049 is because target tokens consist of the input tokens, but left-shifted by one position (so tokens [0 1 2 3]
are seen and used to predict token 4
as target, and so on.) So the first 2048 tokens (only excluding token 2049) of the 2049-token window in a sample are used as inputs to the model, and the last 2048 tokens (only excluding the first token) are used as targets for calculating loss. Thus we calculate loss on 2048 tokens per sample.
Hi @haileyschoelkopf Thank you for the response!
One more thing I'd like to clarify. Am I correct to assume that the tokenized data downloaded according to the instructions here is already shuffled - https://github.com/EleutherAI/pythia#reproducing-training ?
Simply put, to reproduce the exact batches used during training, I need to
Thank you!!
I have the same questions as @prakharg24. Specifically:
There are two version of the deduped, pre-shuffled datasets mentioned in the README. As I understand:
EleutherAI/pythia_deduped_pile_idxmaps
contains tokenized documents without any EOD tokens. I've looked at the data and this seems to be the case.EleutherAI/pile-deduped-pythia-preshuffled
contains tokenized documents with EOD tokens.Is this correct?
If the data is divided into 2049-sized sequences naively, then the first token of each sequence will not be seen (as a label) by the model. Is this intended?
Can someone please help with this? @haileyschoelkopf
I have similar doubts regarding the nature of data as @itsnamgyu @prakharg24..
If I want to only use a subset (say arXiv only) to train a pythia model, how do I download only those pretokenized data (including EOD tokens)?
Any input is appreciated. cc @haileyschoelkopf @crowsonkb @joshlk
Hi @itsnamgyu @prakharg24 hopefully I can answer some of your dataset questions here:
If the data is divided into 2049-sized sequences naively, then the first token of each sequence will not be seen (as a label) by the model. Is this intended?
This is correct. Because we do not train using any BOS tokens, there is no way to see the first token of a sequence as a label by the model. This is because one cannot feed in the empty string to a model (unless it was trained with a BOS token that can act as such. You could attempt to simulate this by passing EOD into the Pythia models, but I am unsure of the behavior that would result.)
If I want to only use a subset (say arXiv only) to train a pythia model, how do I download only those pretokenized data (including EOD tokens)?
@sujantkumarkv unfortunately, when tokenizing the Pile dataset, metadata about subsets are not retained. We don't currently have an easy way to train only on say the arXiv subset, and would recommend retokenizing that dataset alone separately using GPT-NeoX's prepare_data.py .
Regarding how to replicate training order:
If using EleutherAI/pile-deduped-pythia-preshuffled
, then once you've downloaded and combined the shards in this, this should provide, when loaded using MMapIndexedDataset via the script at https://github.com/EleutherAI/pythia/blob/dc24af59cff8c8159a1d4b106393b39c39a1ef2e/utils/batch_viewer.py , will provide a dataset where dataset[i]
is a length-2049 sequence of tokens, that will contain EOD tokens separating the end of one document from the beginning of the next.
To access the j
-th batch item at step k
of training, you can access dataset[(k * 1024) + j]
which should give the context window seen at that item in training.
The EleutherAI/pythia_deduped_pile_idxmaps
contains binidx files that can be used with the script at https://github.com/EleutherAI/pythia/blob/899add0f1c71cb27dbf5a7594202584416c0b424/utils/batch_viewer.py .
These binidx files contain the tokenized documents, prior to chopping them into the context windows seen during training.
Here, these binidx files must be loaded using megatron.data.gpt2_dataset.GPT2Dataset
with the appropriate arguments, in order to perform shuffling via megatron's dataset code (as was done during training) and chop the documents appropriately into context windows.
We've updated the readme to hopefully make more clear how to use the preshuffled binidx files!
If you're looking to reproduce the Pythia training order, for 1) viewing the training data contents: we recommend using the preshuffled binidx files and using the most up-to-date README and batch_viewer.py . This lets you dump the context windows seen by Pythia directly to disk and is significantly faster than the old batch_viewer.py .
2) if you want to re-train Pythia, we recommend doing so in the GPT-NeoX library v1.0 and taking care to use the exact same config file as we provide for the Pythia models.
I hope that this is helpful!
Thanks so much for the detailed answer!
Just to clarify for other readers, I've confirmed that EleutherAI/pythia_deduped_pile_idxmaps
does not have any EOD tokens (but please comment if I'm wrong).
Further to @itsnamgyu comment, I confirm that pile-deduped-pythia-preshuffled
does not have any EOD tokens (I checked a 100k sample, let me know if I missed anything).
@pietrolesci Actually according to the comment above, pile-deduped-pythia-preshuffled
should have EOD tokens while EleutherAI/pythia_deduped_pile_idxmaps
does not, so that is contradicting. Are you should you are referring to pile-deduped-pythia-preshuffled
?
If using EleutherAI/pile-deduped-pythia-preshuffled, then once you've downloaded and combined the shards in this, this should provide, when loaded using MMapIndexedDataset via the script at https://github.com/EleutherAI/pythia/blob/dc24af59cff8c8159a1d4b106393b39c39a1ef2e/utils/batch_viewer.py , will provide a dataset where dataset[i] is a length-2049 sequence of tokens, that will contain EOD tokens separating the end of one document from the beginning of the next.
Note, batch_viewer.py
does not have any code to add EOD tokens.
@haileyschoelkopf sorry for the trouble, but is EleutherAI/pythia_deduped_pile_idxmaps
also pre-shuffled?
You mentioned your comment,
Here, these binidx files must be loaded using megatron.data.gpt2_dataset.GPT2Dataset with the appropriate arguments, in order to perform shuffling via megatron's dataset code (as was done during training) and chop the documents appropriately into context windows.
whereas https://github.com/EleutherAI/pythia#reproducing-training says about EleutherAI/pythia_deduped_pile_idxmaps
We recommend downloading this rather than retokenizing the Pile from scratch in order to guarantee preservation of the data order seen by the Pythia models
I'm training on EleutherAI/pythia_deduped_pile_idxmaps
(while manually injecting EOD) and (1) some manual inspection and (2) the training loss suggests that it is in fact pre-shuffled.
Related to #127
@pietrolesci Actually according to the comment above,
pile-deduped-pythia-preshuffled
should have EOD tokens whileEleutherAI/pythia_deduped_pile_idxmaps
does not, so that is contradicting. Are you should you are referring topile-deduped-pythia-preshuffled
?
Hi @itsnamgyu, I confirm that -- contrary to what is expected and described in the README -- the pile-deduped-pythia-preshuffled
does NOT have an EOD token.
I also ran into the absence of EOD tokens just now 👀
will ping @haileyschoelkopf
I have checked the preshuffled dataset. And find that the authentic seq_length is 2050 which is not equal to 2049 as described.
Hi, I am using utils/batch_viewer.py to iterate through Pythia's training data and calculate some batch-level statistics. Firstly, there are some gaps between the actual code in batch_viewer.py and the expected code according to the README (For example, it doesn't take any 'config file' as input, the 'load file' name needs to be supplied separately, etc.). But these differences were obvious enough that I could fix them on my end and run the code.
However, it's the final step of saving the data after loading the buffer that I'm a bit confused about. I have two questions,