Is there any best practice for using litdata to load custom data for pretraining?

Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

https://lightning.ai

Apache License 2.0

8.26k stars 846 forks source link

Is there any best practice for using litdata to load custom data for pretraining? #1428

Open wen020 opened 1 month ago

wen020 commented 1 month ago

Is there any best practice for using litdata to load custom data for pretraining? I found that TextFiles.py and prepare_slimpajama.py have similar data preprocessing methods. The difference between them is in the tokenizer. One is tokenizer during preprocessing, and the other is tokenizer during training？Why TextFiles are not suitable for handling large amounts of data？

rasbt commented 1 month ago

Good point. I think the main thing here is that if you have large amounts of texts, you would store it in a compressed or pretokenized format, and perhaps also not want to store it locally where you don't have that much storage space on your machine. But if none of that is a concern, you could use the TextFiles option (note that it is not doing any preprocessing for you either though, like removing weird formatting characters etc).