EleutherAI / pythia

The hub for EleutherAI's work on interpretability and learning dynamics
Apache License 2.0
2.16k stars 156 forks source link

Has the data been shuffled? #127

Open Lisennlp opened 9 months ago

Lisennlp commented 9 months ago

Hello, I see your batch_view.py, found that the data does not use a shuffle, but in the gpt-neox library, the data is shuffled. So I want to make sure that the author did or did not shuffle during the training? Hope to get your answer, thank you!

pietrolesci commented 6 months ago

I think this might provide an answer https://github.com/EleutherAI/pythia/issues/123#issuecomment-1878882214

itsnamgyu commented 6 months ago

The data is shuffled in terms of documents. The repo-id says preshuffled in https://github.com/EleutherAI/pythia#exploring-the-dataset, i.e., EleutherAI/pile-standard-pythia-preshuffled.

I'm actually not sure about https://huggingface.co/datasets/EleutherAI/pythia_deduped_pile_idxmaps mentioned in https://github.com/EleutherAI/pythia#reproducing-training. I will add a quesiton about this on #123.