NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 32 forks source link

[BUG] Fuzzy deduplication fails on datasets with no duplicates #67

Open Maghoumi opened 1 month ago

Maghoumi commented 1 month ago

Describe the bug

When attempting to run fuzzy deduplication on a dataset that has no duplicates, the code errors out.

Steps/Code to reproduce bug

1) Clone the repo 2) Run the TinyStories tutorial 3) Run examples/fuzzy_deduplication.py on the dataset under tutorials/tinystories/data/jsonl/val/

Expected behavior

The code should not crash.

Environment details Using NVIDIA pytorch:24.04-py3 image

glam621 commented 6 days ago

Assigning it to @ayushdg