Open Maghoumi opened 1 month ago
Describe the bug
When attempting to run fuzzy deduplication on a dataset that has no duplicates, the code errors out.
Steps/Code to reproduce bug
1) Clone the repo 2) Run the TinyStories tutorial 3) Run examples/fuzzy_deduplication.py on the dataset under tutorials/tinystories/data/jsonl/val/
examples/fuzzy_deduplication.py
tutorials/tinystories/data/jsonl/val/
Expected behavior
The code should not crash.
Environment details Using NVIDIA pytorch:24.04-py3 image
Assigning it to @ayushdg
Describe the bug
When attempting to run fuzzy deduplication on a dataset that has no duplicates, the code errors out.
Steps/Code to reproduce bug
1) Clone the repo 2) Run the TinyStories tutorial 3) Run
examples/fuzzy_deduplication.py
on the dataset undertutorials/tinystories/data/jsonl/val/
Expected behavior
The code should not crash.
Environment details Using NVIDIA pytorch:24.04-py3 image