Open wen020 opened 1 month ago
Good point. I think the main thing here is that if you have large amounts of texts, you would store it in a compressed or pretokenized format, and perhaps also not want to store it locally where you don't have that much storage space on your machine. But if none of that is a concern, you could use the TextFiles
option (note that it is not doing any preprocessing for you either though, like removing weird formatting characters etc).
Is there any best practice for using litdata to load custom data for pretraining? I found that TextFiles.py and prepare_slimpajama.py have similar data preprocessing methods. The difference between them is in the tokenizer. One is tokenizer during preprocessing, and the other is tokenizer during training?Why TextFiles are not suitable for handling large amounts of data?