NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
478 stars 57 forks source link

Update Fuzzy dedup params for long strings support. #77

Closed ayushdg closed 3 weeks ago

ayushdg commented 4 months ago

Fuzzy deduplication is currently accelerated via cuDF which until release 24.04 had a limit that a string column could not exceed int32 number of characters. Consequently some defaults and core logic in the deduplication pipeline aims to mitigate errors for cases where we may exceed this value.

Starting 24.06, cuDF has experimental support for longer strings (int64 number of chars), and this PR attempts to change defaults and simplify logic around handling long strings.