Open Esmail-ibraheem opened 2 months ago
Good point. Does the LitData section here help? https://github.com/Lightning-AI/litdata?tab=readme-ov-file#1-prepare-your-data
no, I did not understand from the litdata, how I can convert or process my custom dataset so I can use it in litgpt:
Personally, I use the TextFiles
approach that I've implemented in LitGPT. But going back to an earlier comment you had, (and the phrase in the docs), my colleagues don't recommend it for very large datasets since it starts with plain text files (rather than tokenized text), and plain text can be inefficient to store.
Personally, I don't have much experience with LitData, but If I ever prepare a large custom dataset, I'll amend the docs. In the meantime, the best way is perhaps to look at how its done for the prepare_slimpajama.py
and prepare_starcoder.py
in https://github.com/Lightning-AI/litgpt/tree/main/litgpt/data
which are used in the Pretrain TinyLlama. Thomas Chaton, who is the developer of LitData, also has a tutorial on the dataset prep here which could be helpful
I have an Arabic dataset of size 96GB that I want to use for pre-training litGPT. However, in the image provided [link to the image], it is mentioned that if the dataset is large, we should use litdata. But when I checked the README of litdata, there were no clear instructions on how to do it.
Here is the dataset I want to use: https://huggingface.co/datasets/ClusterlabAi/101_billion_arabic_words_dataset
Thank you.