Difference between LFS and HuggingFace datasets?

EleutherAI / pythia

The hub for EleutherAI's work on interpretability and learning dynamics

Apache License 2.0

2.23k stars 165 forks source link

Ahh, I meant to follow up to amend that readme discussion--the HF datasets https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated and https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile are NOT the same shuffle order as the LFS pretokenized dataset.

However, those datasets should be the same order as the data that was used with GPT-NeoX's preprocess_data.py (https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile/tree/main). one feasible solution then would be to tokenize that data and confirm via some checks that it does indeed end up in the same order as the pretokenized files, but this'd still require having 2 copies of the data on your disk

I could see about there being another way for us to distribute the data, so that you don't need to have 2 copies of the .bin file to concatenate them all. Huggingface's 50GB limit per file is what we ran up against here unfortunately.

EleutherAI / pythia

Difference between LFS and HuggingFace datasets? #112