Open ayushdg opened 1 month ago
Describe the bug
If the merge result b/w text and bucket mapping df is empty for any iteration the logic fails.
Failure is observed here but originates from https://github.com/NVIDIA/NeMo-Curator/blob/fe9fd6f46a932689ba036c623b2737298478c8ea/nemo_curator/utils/fuzzy_dedup_utils/shuffle_utils.py#L144 being empty. Still working on a minimal repro.
Additional context
The fix should be to continue on with the loop if this is a 0 len df.
Error here looks like ValueError: zero-size array to reduction operation maximum which has no identity
ValueError: zero-size array to reduction operation maximum which has no identity
Describe the bug
If the merge result b/w text and bucket mapping df is empty for any iteration the logic fails.
Failure is observed here but originates from https://github.com/NVIDIA/NeMo-Curator/blob/fe9fd6f46a932689ba036c623b2737298478c8ea/nemo_curator/utils/fuzzy_dedup_utils/shuffle_utils.py#L144 being empty. Still working on a minimal repro.
Additional context
The fix should be to continue on with the loop if this is a 0 len df.
Error here looks like
ValueError: zero-size array to reduction operation maximum which has no identity