NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 32 forks source link

Fuzzy Dedup: Use text_field instead of hardcoded text column #74

Closed ayushdg closed 1 month ago

ayushdg commented 1 month ago

One of the functions in jaccard computation for Fuzzy dedup assume the text field of the dataset to be called text and doesn't use the text_field information provided.

Additionally, explicitly set query planning to false.

ayushdg commented 1 month ago

Good catch. Thanks!

Whoops just realized I was setting query-planning to True in the PR. just updated.