NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
478 stars 57 forks source link

Improve speed of AddId module #36

Closed ryantwolf closed 5 months ago

ryantwolf commented 5 months ago

Improves the speed of the add id module by reducing the size of the task graph. However, this new method uses a different id structure. If users need the old format that is guaranteed to be ordered, they can still access by passing in a definitive start_id.