NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
632 stars 85 forks source link

semantic_dedupe runs into IndexError: list index out of range #341

Open ruchaa-apte opened 1 month ago

ruchaa-apte commented 1 month ago

Describe the bug

While running Semantic Deduplication on text files, it starts semantic dedupe pipeline, but runs into IndexError: list index out of range Error Log

GPU: 0, Part: 20: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.57it/s]
2024-10-30 13:42:26,014 - distributed.utils_perf - WARNING - full garbage collections took 72% CPU time recently (threshold: 10%)
2024-10-30 13:42:26,060 - distributed.utils_perf - WARNING - full garbage collections took 65% CPU time recently (threshold: 10%)
2024-10-30 13:42:26,196 - distributed.worker - WARNING - Compute Failed
Key:       ('read_single_partition-fused-toparquetdata-f053b8f0935a4edb94f161972e2f27a8', 2)
State:     executing
Function:  execute_task
args:      ((<function Fused._execute_task at 0x7f2e10b56d40>, {'read_single_partition-fused-toparquetdata-f053b8f0935a4edb94f161972e2f27a8': ('toparquetdata-9d98c9d2f77890cf221ce0d97398b829', 2), ('toparquetdata-9d98c9d2f77890cf221ce0d97398b829', 2): (<dask.dataframe.io.parquet.core.ToParquetFunctionWrapper object at 0x7f2a9b3a6140>, ('reset_index-0ced2634f121de0dd6cc480baf637ca7', 2), (2,)), ('reset_index-0ced2634f121de0dd6cc480baf637ca7', 2): (<function apply at 0x7f2e663ec9d0>, <methodcaller: reset_index>, [('<crossfit.backend.torch.op.base.predictor object a-bd58448c0c3a7e2471c1d5ce629f4850', 2)], {'drop': True}), ('<crossfit.backend.torch.op.base.predictor object a-bd58448c0c3a7e2471c1d5ce629f4850', 2): (<function apply at 0x7f2e663ec9d0>, <function apply_and_enforce at 0x7f2e1139b910>, [('<crossfit.op.tokenize.tokenizer object at 0x7fdad0-ea35aded3a541ceda1ad391c99bb6e42', 2)], {'partition_info': {'number': 2, 'division': None}, '_func': <crossfit.backend.torch.op.base.Predictor object at 
kwargs:    {}
Exception: "IndexError('list index out of range')"
Traceback: '  File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_expr.py", line 3758, in _execute_task\n    return dask.core.get(graph, name)\n  File "/home/nemo_curator/lib/python3.10/site-packages/dask/core.py", line 157, in get\n    result = _execute_task(task, cache)\n  File "/home/nemo_curator/lib/python3.10/site-packages/dask/core.py", line 127, in _execute_task\n    return func(*(_execute_task(a, cache) for a in args))\n  File "/home/nemo_curator/lib/python3.10/site-packages/dask/utils.py", line 78, in apply\n    return func(*args, **kwargs)\n  File "/home/nemo_curator/lib/python3.10/site-packages/dask/dataframe/core.py", line 7164, in apply_and_enforce\n    df = func(*args, **kwargs)\n  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/base.py", line 96, in __call__\n    output = self.call(data, *args, **kwargs)\n  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 155, in call\n    input_ids, attention_mask = self.call_column(data[col])\n  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 120, in call_column\n    tokenized_data = self.tokenize_strings(text).copy()\n  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 71, in tokenize_strings\n    tokenized_data = tokenizer.batch_encode_plus(\n  File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3306, in batch_encode_plus\n    return self._batch_encode_plus(\n  File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 562, in _batch_encode_plus\n    for key in tokens_and_encodings[0][0].keys():\n'

2024-10-30 13:42:26,203 - distributed.utils_perf - WARNING - full garbage collections took 73% CPU time recently (threshold: 10%)
GPU: 0, Part: 19:   0%|                                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s]2024-10-30 13:42:26,302 - distributed.utils_perf - WARNING - full garbage collections took 65% CPU time recently (threshold: 10%)
Traceback (most recent call last):
  File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 283, in <module>
    main()
  File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 265, in main
    run_curation_pipeline(args, text_files, code_files)
  File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/main.py", line 177, in run_curation_pipeline
    semantic_dataset_text = semantic_dedupe(dataset=gpu_dataset_text, sem_dedupe_config_yaml_path=sem_dedupe_config_yaml_path, type='text')
  File "/home/projects/chem-data-curation/NeMo-Curator/tutorials/dapt-curation/code/utils.py", line 354, in semantic_dedupe
    duplicates = semdup(dataset)
  File "/home/nemo_curator/lib/python3.10/site-packages/nemo_curator/modules/semantic_dedup.py", line 637, in __call__
    embeddings_dataset = self.embedding_creator(dataset)
  File "/home/nemo_curator/lib/python3.10/site-packages/nemo_curator/modules/semantic_dedup.py", line 215, in __call__
    write_to_disk(
  File "/home/nemo_curator/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 577, in write_to_disk
    df.to_parquet(output_file_dir, write_index=False)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_collection.py", line 3281, in to_parquet
    return to_parquet(self, path, **kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/io/parquet.py", line 653, in to_parquet
    out = out.compute(**compute_kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_collection.py", line 476, in compute
    return DaskMethodsMixin.compute(out, **kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask/base.py", line 376, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask/base.py", line 662, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/dask_expr/_expr.py", line 3758, in _execute_task
    return dask.core.get(graph, name)
  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/base.py", line 96, in __call__
    output = self.call(data, *args, **kwargs)
  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 155, in call
    input_ids, attention_mask = self.call_column(data[col])
  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 120, in call_column
    tokenized_data = self.tokenize_strings(text).copy()
  File "/home/nemo_curator/lib/python3.10/site-packages/crossfit/op/tokenize.py", line 71, in tokenize_strings
    tokenized_data = tokenizer.batch_encode_plus(
  File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3306, in batch_encode_plus
    return self._batch_encode_plus(
  File "/home/nemo_curator/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 562, in _batch_encode_plus
    for key in tokens_and_encodings[0][0].keys():
IndexError: list index out of range

Steps/Code to reproduce bug Config for semantic dedupe

# Configuration file for semdantic dedup
cache_dir: "workspace/text/semdedup_cache"
num_files: 16
id_col_name: "id"
id_col_type: "str"
input_column: "text"

# Embeddings configuration
embeddings_save_loc: "embeddings"
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
embedding_batch_size: 128
embedding_max_mem_gb: 20

# Clustering configuration
clustering_save_loc: "clustering_results"
n_clusters: 20
seed: 1234
max_iter: 100
kmeans_with_cos_dist: false

# Semdedup configuration
which_to_keep: "hard"
largest_cluster_size_to_process: 100000
sim_metric: "cosine"

# Extract dedup configuration
eps_thresholds:
  - 0.01
  - 0.001

# Which threshold to use for extracting deduped data
eps_to_extract: 0.01
    cache_dir = f"./workspace/semantic_dedupe/{type}"
    if os.path.isdir(cache_dir):
        os.system(f"rm -rf {cache_dir}")

    semdedup_config = SemDedupConfig.from_yaml(sem_dedupe_config_yaml_path)
    expand_outdir_and_mkdir(semdedup_config.cache_dir)
    semdup = SemDedup(semdedup_config)
    duplicates = semdup(dataset)

Environment overview

Environment details

VibhuJawa commented 3 weeks ago

This was due to an empty partition and was fixed by

partition_lengths = ddf.map_partitions(len).compute()
non_empty_partitions = [i for i, length in enumerate(partition_lengths) if length > 0]
filtered_ddf = ddf.partitions[non_empty_partitions]

We should long term fix this in crossfit or NeMo Curator or at least fail loudly