InternLM / InternEvo

Apache License 2.0
258 stars 42 forks source link

[Feature] a very simple hugging-face dataloader #101

Open sunpengsdu opened 5 months ago

sunpengsdu commented 5 months ago

Describe the feature

a very simple on-the-fly dataloader is needed to support most pubic dataset

Will you implement it?

zigzagcai commented 1 month ago

Completed in https://github.com/InternLM/InternEvo/pull/244

  1. load huggingface datasets in streaming mode, a.k.a, lazy load data samples and no need to download the whole datasets before training
  2. on-the-fly tokenization
  3. support auto_resume for hf dataloader
  4. support packing for hf dataloader to utilize hardware efficiency