NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 33 forks source link

[BUG] Jaccard Shuffle error if merge result is empty #49

Open ayushdg opened 1 month ago

ayushdg commented 1 month ago

Describe the bug

If the merge result b/w text and bucket mapping df is empty for any iteration the logic fails.

Failure is observed here but originates from https://github.com/NVIDIA/NeMo-Curator/blob/fe9fd6f46a932689ba036c623b2737298478c8ea/nemo_curator/utils/fuzzy_dedup_utils/shuffle_utils.py#L144 being empty. Still working on a minimal repro.

Additional context

The fix should be to continue on with the loop if this is a 0 len df.

Error here looks like ValueError: zero-size array to reduction operation maximum which has no identity