EleutherAI / pythia

The hub for EleutherAI's work on interpretability and learning dynamics
Apache License 2.0
2.16k stars 156 forks source link

Reading data is slowly! #126

Open Lisennlp opened 9 months ago

Lisennlp commented 9 months ago

I followed readme:

  git lfs clone https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps
  python utils/unshard_memmap.py --input_file ./pythia_deduped_pile_idxmaps/pile_0.87_deduped_text_document-00000-of-00082.bin --num_shards 83 --output_dir ./pythia_pile_idxmaps/

I got a 600+G file, and then I used gpt-neox's dataloader to read the data, which was very slow. It takes about 6s to read 2048-length pieces of data. May I ask why?

image

liu09114 commented 6 months ago

I get a file onlu 386G.. "386G Jan 30 13:28 pile_0.87_deduped_text_document.bin" And I didn't get the '*.idx' file, should we use the download idx file directly?