Open ayushdg opened 1 month ago
Describe the bug
By default when reading from json/parquet files, unless an index is specified, Curator typically reads in each partition with an index ranging from 0->len(partition). However for dataframes where this is not the case, Fuzzy dedup might fail.
Steps/Code to reproduce bug
Reproducer in #46 tests, root cause seems to be coming from https://github.com/NVIDIA/NeMo-Curator/blob/fe9fd6f46a932689ba036c623b2737298478c8ea/nemo_curator/utils/fuzzy_dedup_utils/merge_utils.py#L161 where the lhs df might have different indices but the rhs starts from 0 resulting in assignment.
Expected behavior
No errors
Describe the bug
By default when reading from json/parquet files, unless an index is specified, Curator typically reads in each partition with an index ranging from 0->len(partition). However for dataframes where this is not the case, Fuzzy dedup might fail.
Steps/Code to reproduce bug
Reproducer in #46 tests, root cause seems to be coming from https://github.com/NVIDIA/NeMo-Curator/blob/fe9fd6f46a932689ba036c623b2737298478c8ea/nemo_curator/utils/fuzzy_dedup_utils/merge_utils.py#L161 where the lhs df might have different indices but the rhs starts from 0 resulting in assignment.
Expected behavior
No errors