Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
10.13k stars 1k forks source link

Git-LFS doesn't work for downloading `RedPajama-Data-1T` dataset #722

Open Andrei-Aksionov opened 10 months ago

Andrei-Aksionov commented 10 months ago

Hi there 👋

In the tutorial tutorials/pretrain_redpajama.md it's said that you can download full-size and sample-size RedPajama dataset with help of git lfs. At least as of right now, it's possible only for sample dataset.

On HF page for the sample dataset, you can find the list of lfs files: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample/tree/main

But not for full-size version: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T/tree/main

For the full-size variant, only URLs are provided.

carmocca commented 10 months ago

What do you recommend for the full dataset?

Andrei-Aksionov commented 10 months ago

For now I don't have a clean solution, unfortunately.

From what I see, the repo has RedPajama-Data-1T.py that should help with downloading the dataset, but it's more of a config for HF's load_dataset function. And I would like to have some CLI command.

Another option is to use the code snippet from the README:

wget 'https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt'
while read line; do
    dload_loc=${line#https://data.together.xyz/redpajama-data-1T/v1.0.0/}
    mkdir -p $(dirname $dload_loc)
    wget "$line" -O "$dload_loc"
done < urls.txt

and it works, albeit looks a bit clunky.

carmocca commented 10 months ago

load_dataset should be fine, we already specify it as an optional dependency: https://github.com/Lightning-AI/lit-gpt/blob/main/requirements-all.txt#L6