NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 32 forks source link

[FEA] Add batched files reading to separate_by_metadata.py #53

Open miguelusque opened 1 month ago

miguelusque commented 1 month ago

Is your feature request related to a problem? Please describe. separate_by_metadata.py script reads all the files at once, and distributes them through the different Dask workers. That could lead to OOMs.

Describe the solution you'd like To read the files in batches, to reduce the chances of an OOM.

miguelusque commented 1 month ago

I will work on this one