NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
482 stars 58 forks source link

Add dataset blending tool #32

Closed ryantwolf closed 5 months ago

ryantwolf commented 5 months ago

Add functionality for large scale dataset blending and shuffling.

ryantwolf commented 5 months ago

@ayushdg could you please take a look at why the exact deduplication tests are failing here? I didn't modify anything related to dedup, but evidently one of my changes triggered this.

ayushdg commented 5 months ago

Haven't been able to reproduce locally. But on first glance it looks like the dask config options when we get to the assert_eq is no longer "tasks" for the shuffle method. Overriding it in the test would allow things to past but I'm curious to see why it isn't reproducing locally.