NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 32 forks source link

Align `extract_partitioning_index` logic with upstream shuffling #60

Closed rjzamora closed 1 month ago

rjzamora commented 1 month ago

Dask modified how partitioning_index is used for shuffling in https://github.com/dask/dask/pull/10705 (included in dask>=2023.12.1). This PR modifies extract_partitioning_index to use the same logic.

TODO:

rjzamora commented 1 month ago

cc @VibhuJawa

rjzamora commented 1 month ago

@ayushdg - thanks for the review. Still don't have an evironment to test this myself. I will try to do that later today, but if it's easy for you to test it is very welcome on my end :)

ayushdg commented 1 month ago

Can confirm I'm seeing expected results on the larger scale dataset with this fix. I'll run a few more tests on my end but it's generally looking good. @rjzamora Do you have a small example that checks consistency of the behavior of these two shuffle approaches, that could be added as a unit test with the PR?

rjzamora commented 1 month ago

Do you have a small example that checks consistency of the behavior of these two shuffle approaches, that could be added as a unit test with the PR?

@VibhuJawa - Still trying to fix my environment so I can confirm locally, but won't the FuzzyDuplicates tests fail without this fix in place?

ayushdg commented 1 month ago

Do you have a small example that checks consistency of the behavior of these two shuffle approaches, that could be added as a unit test with the PR?

@VibhuJawa - Still trying to fix my environment so I can confirm locally, but won't the FuzzyDuplicates tests fail without this fix in place?

In theory the shuffle gives incorrect results, but the dataset/num_partitions here is small enough that it doesn't impact the correctness of final results (duplicate documents detected)