This PR shuffles the connected_components output by group to ensure that all documents belonging to the same duplicate group/component end up in the same partition. The motivation for doing this ensure the example showcasing document removal work as intended: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fuzzy_deduplication.py#L87.
Description
This PR shuffles the
connected_components
output by group to ensure that all documents belonging to the same duplicate group/component end up in the same partition. The motivation for doing this ensure the example showcasing document removal work as intended: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fuzzy_deduplication.py#L87.Usage
N/A
Checklist