NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
327 stars 32 forks source link

Shuffle CC result on group before writing out #110

Closed ayushdg closed 2 weeks ago

ayushdg commented 2 weeks ago

Description

This PR shuffles the connected_components output by group to ensure that all documents belonging to the same duplicate group/component end up in the same partition. The motivation for doing this ensure the example showcasing document removal work as intended: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fuzzy_deduplication.py#L87.

Usage

N/A

Checklist