[FEA] Improve separate_by_metadata performance when dealing with jsonl files

When using separate_by_metada functionality in a corpus of jsonl files, there is no need to read all the files before separating them.

By having an alternative implementation, the memory needed will be reduced significantly, from O(N) to O(1) and the chances of OOM will reduce significantly, to almost zero.

I will also add a new feature, which allows to select which fields to keep, or which fields to exclude. This is interesting, for instance, after applying a quality classifier, where some user would like to keep only "High" quality documents.

I will submit a PR for these features.

NVIDIA / NeMo-Curator

[FEA] Improve separate_by_metadata performance when dealing with jsonl files #255