Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
9.84k stars 984 forks source link

processing the dataset. #1549

Open Esmail-ibraheem opened 2 months ago

Esmail-ibraheem commented 2 months ago

I have an Arabic dataset of size 96GB that I want to use for pre-training litGPT. However, in the image provided [link to the image], it is mentioned that if the dataset is large, we should use litdata. But when I checked the README of litdata, there were no clear instructions on how to do it.

big_data

Here is the dataset I want to use: https://huggingface.co/datasets/ClusterlabAi/101_billion_arabic_words_dataset

Thank you.

rasbt commented 2 months ago

Good point. Does the LitData section here help? https://github.com/Lightning-AI/litdata?tab=readme-ov-file#1-prepare-your-data

Esmail-ibraheem commented 2 months ago

no, I did not understand from the litdata, how I can convert or process my custom dataset so I can use it in litgpt:

rasbt commented 2 months ago

Personally, I use the TextFiles approach that I've implemented in LitGPT. But going back to an earlier comment you had, (and the phrase in the docs), my colleagues don't recommend it for very large datasets since it starts with plain text files (rather than tokenized text), and plain text can be inefficient to store.

Personally, I don't have much experience with LitData, but If I ever prepare a large custom dataset, I'll amend the docs. In the meantime, the best way is perhaps to look at how its done for the prepare_slimpajama.py and prepare_starcoder.py in https://github.com/Lightning-AI/litgpt/tree/main/litgpt/data

which are used in the Pretrain TinyLlama. Thomas Chaton, who is the developer of LitData, also has a tutorial on the dataset prep here which could be helpful