Open ruchaa-apte opened 1 month ago
This was due to an empty partition and was fixed by
partition_lengths = ddf.map_partitions(len).compute()
non_empty_partitions = [i for i, length in enumerate(partition_lengths) if length > 0]
filtered_ddf = ddf.partitions[non_empty_partitions]
We should long term fix this in crossfit
or NeMo Curator
or at least fail loudly
Describe the bug
While running Semantic Deduplication on text files, it starts semantic dedupe pipeline, but runs into
IndexError: list index out of range
Error LogSteps/Code to reproduce bug Config for semantic dedupe
Environment overview
Environment details