Closed rjzamora closed 1 month ago
cc @VibhuJawa
@ayushdg - thanks for the review. Still don't have an evironment to test this myself. I will try to do that later today, but if it's easy for you to test it is very welcome on my end :)
Can confirm I'm seeing expected results on the larger scale dataset with this fix. I'll run a few more tests on my end but it's generally looking good. @rjzamora Do you have a small example that checks consistency of the behavior of these two shuffle approaches, that could be added as a unit test with the PR?
Do you have a small example that checks consistency of the behavior of these two shuffle approaches, that could be added as a unit test with the PR?
@VibhuJawa - Still trying to fix my environment so I can confirm locally, but won't the FuzzyDuplicates
tests fail without this fix in place?
Do you have a small example that checks consistency of the behavior of these two shuffle approaches, that could be added as a unit test with the PR?
@VibhuJawa - Still trying to fix my environment so I can confirm locally, but won't the
FuzzyDuplicates
tests fail without this fix in place?
In theory the shuffle gives incorrect results, but the dataset/num_partitions here is small enough that it doesn't impact the correctness of final results (duplicate documents detected)
Dask modified how
partitioning_index
is used for shuffling in https://github.com/dask/dask/pull/10705 (included indask>=2023.12.1
). This PR modifiesextract_partitioning_index
to use the same logic.TODO: