Closed ZhibinDuan closed 11 months ago
Hi, Zhibin. I also found this issue that the Eleuther AI just put the Pile dataset offline. But I found that there is still an available resources on Huggingface dataset. Please check: https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile/tree/main
I have downloaded the dataset from https://huggingface.co/datasets/EleutherAI/raw_deduplicated_pile/tree/main, which consists of 19 files, and the name is pile_train_deduped0-19.jsonl.
However, the dataset is a raw dataset, and I can't use your data processing code to sample a training dataset. Can you help me further?
Thank you very much.
In your paper, the training dataset for memory adaptive learning includes 26B tokens, but I can't find the method for how to build the training dataset for adaptive learning.
Besides, the link of dataset can't download from its url: https://the-eye.eu/public/AI/pile/. can you help me solve above problems. Thanks for you.