Can `preprocess_data.py` support Huggingface Dataset?

EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

https://www.eleuther.ai/

Apache License 2.0

6.96k stars 1.02k forks source link

Can `preprocess_data.py` support Huggingface Dataset? #1321

Open cafeii opened 1 week ago

cafeii commented 1 week ago

Since there are many datasets in the format of Huggingface datasets, it would be convenient if preprocess_data.py can directly preprocess and tokenize from HF datasets.

StellaAthena commented 5 days ago

The overwhelming majority of HuggingFace datasets are not structured in a way that makes sense for LLM pretraining. Given that, what do you envision this looking like? Specifying a field name and only training on the text in that field?