When using separate_by_metada functionality in a corpus of jsonl files, there is no need to read all the files before separating them.
By having an alternative implementation, the memory needed will be reduced significantly, from O(N) to O(1) and the chances of OOM will reduce significantly, to almost zero.
I will also add a new feature, which allows to select which fields to keep, or which fields to exclude. This is interesting, for instance, after applying a quality classifier, where some user would like to keep only "High" quality documents.
When using
separate_by_metada
functionality in a corpus of jsonl files, there is no need to read all the files before separating them.By having an alternative implementation, the memory needed will be reduced significantly, from O(N) to O(1) and the chances of OOM will reduce significantly, to almost zero.
I will also add a new feature, which allows to select which fields to keep, or which fields to exclude. This is interesting, for instance, after applying a quality classifier, where some user would like to keep only "High" quality documents.
I will submit a PR for these features.