NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
482 stars 58 forks source link

Add batched decorator #18

Closed ryantwolf closed 6 months ago

ryantwolf commented 6 months ago

Refactors batched to be in a decorator. This frees the user from having to be knowledgable about how the underlying DocumentFilter or DocumentModifier when initializing ScoreFilter or Modify respectively.

Unit tests pass. The following scripts were manually tested and work properly.

examples/classifier_filtering.py
examples/find_pii_and_deidentify.py
nemo_curator/scripts/find_pii_and_deidentify.py
tutorials/tinystories/main.py

The examples and other scripts have been manually tested to work.