Open cafeii opened 1 week ago
The overwhelming majority of HuggingFace datasets are not structured in a way that makes sense for LLM pretraining. Given that, what do you envision this looking like? Specifying a field name and only training on the text in that field?
Since there are many datasets in the format of Huggingface datasets, it would be convenient if
preprocess_data.py
can directly preprocess and tokenize from HF datasets.