NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
611 stars 83 forks source link

[FEA] Improve separate_by_metadata performance when dealing with jsonl files #255

Closed miguelusque closed 1 month ago

miguelusque commented 1 month ago

When using separate_by_metada functionality in a corpus of jsonl files, there is no need to read all the files before separating them.

By having an alternative implementation, the memory needed will be reduced significantly, from O(N) to O(1) and the chances of OOM will reduce significantly, to almost zero.

I will also add a new feature, which allows to select which fields to keep, or which fields to exclude. This is interesting, for instance, after applying a quality classifier, where some user would like to keep only "High" quality documents.

I will submit a PR for these features.