Open Andrei-Aksionov opened 10 months ago
What do you recommend for the full dataset?
For now I don't have a clean solution, unfortunately.
From what I see, the repo has RedPajama-Data-1T.py
that should help with downloading the dataset, but it's more of a config for HF's load_dataset
function.
And I would like to have some CLI command.
Another option is to use the code snippet from the README:
wget 'https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt'
while read line; do
dload_loc=${line#https://data.together.xyz/redpajama-data-1T/v1.0.0/}
mkdir -p $(dirname $dload_loc)
wget "$line" -O "$dload_loc"
done < urls.txt
and it works, albeit looks a bit clunky.
load_dataset
should be fine, we already specify it as an optional dependency: https://github.com/Lightning-AI/lit-gpt/blob/main/requirements-all.txt#L6
Hi there 👋
In the tutorial
tutorials/pretrain_redpajama.md
it's said that you can download full-size and sample-size RedPajama dataset with help ofgit lfs
. At least as of right now, it's possible only forsample
dataset.On HF page for the
sample
dataset, you can find the list oflfs
files: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample/tree/mainBut not for
full-size
version: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T/tree/mainFor the full-size variant, only URLs are provided.