EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
5.88k stars 1.57k forks source link

IndexError on BBH tasks #1688

Open RylanSchaeffer opened 3 months ago

RylanSchaeffer commented 3 months ago

Across a few models and a few BBH tasks, I obtain this error:

    match = [m for m in match if m][0]
IndexError: list index out of range

The full stack trace is below:

$ lm_eval --model hf                 --model_args pretrained=01-ai/Yi-6B,trust_remote_code=True                 --tasks bbh_cot_zeroshot_web_of_lies                 --batch_size auto:4                 --device cuda:1                 --output_path eval_results/Yi_6B/bbh_cot_zeroshot_web_of_lies                 --log_samples
2024-04-08:13:58:41,492 INFO     [__main__.py:225] Verbosity set to INFO
2024-04-08:13:58:41,492 INFO     [__init__.py:373] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used.
2024-04-08:13:58:44,990 INFO     [__main__.py:311] Selected Tasks: ['bbh_cot_zeroshot_web_of_lies']
2024-04-08:13:58:44,990 INFO     [__main__.py:312] Loading selected tasks...
2024-04-08:13:58:44,992 INFO     [evaluator.py:129] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-04-08:13:58:46,670 WARNING  [logging.py:61] Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-04-08:13:58:46,670 INFO     [huggingface.py:162] Using device 'cuda:1'
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.09it/s]
2024-04-08:13:58:49,152 INFO     [evaluator.py:190] get_task_dict has been updated to accept an optional argument, `task_manager`Read more here:https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage
2024-04-08:13:58:50,931 WARNING  [task.py:322] [Task: bbh_cot_zeroshot_web_of_lies] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-04-08:13:58:50,931 WARNING  [task.py:322] [Task: bbh_cot_zeroshot_web_of_lies] has_training_docs and has_validation_docs are False, using test_docs as fewshot_docs but this is not recommended.
2024-04-08:13:58:50,937 INFO     [task.py:395] Building contexts for bbh_cot_zeroshot_web_of_lies on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [00:00<00:00, 2151.14it/s]
2024-04-08:13:58:51,057 INFO     [evaluator.py:357] Running generate_until requests
Running generate_until requests:   0%|                                                                                                                                                                                                                                                                                                                                                                                                                                  | 0/250 [00:00<?, ?it/s]Passed argument batch_size = auto. Detecting largest batch size
Determined Largest batch size: 16
Running generate_until requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [01:18<00:00,  3.19it/s]
Traceback (most recent call last):
  File "/lfs/skampere1/0/rschaef/miniconda3/envs/pred_llm_evals_env/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/lfs/skampere1/0/rschaef/KoyejoLab-Predictable-LLM-Evals/submodules/lm-evaluation-harness/lm_eval/__main__.py", line 318, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/lfs/skampere1/0/rschaef/KoyejoLab-Predictable-LLM-Evals/submodules/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/lfs/skampere1/0/rschaef/KoyejoLab-Predictable-LLM-Evals/submodules/lm-evaluation-harness/lm_eval/evaluator.py", line 230, in simple_evaluate
    results = evaluate(
  File "/lfs/skampere1/0/rschaef/KoyejoLab-Predictable-LLM-Evals/submodules/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/lfs/skampere1/0/rschaef/KoyejoLab-Predictable-LLM-Evals/submodules/lm-evaluation-harness/lm_eval/evaluator.py", line 383, in evaluate
    task.apply_filters()
  File "/lfs/skampere1/0/rschaef/KoyejoLab-Predictable-LLM-Evals/submodules/lm-evaluation-harness/lm_eval/api/task.py", line 971, in apply_filters
    f.apply(self._instances)
  File "/lfs/skampere1/0/rschaef/KoyejoLab-Predictable-LLM-Evals/submodules/lm-evaluation-harness/lm_eval/api/filter.py", line 51, in apply
    resps = f().apply(resps, docs)
  File "/lfs/skampere1/0/rschaef/KoyejoLab-Predictable-LLM-Evals/submodules/lm-evaluation-harness/lm_eval/filters/extraction.py", line 46, in apply
    filtered_resps = list(map(lambda x: filter_set(x), resps))
  File "/lfs/skampere1/0/rschaef/KoyejoLab-Predictable-LLM-Evals/submodules/lm-evaluation-harness/lm_eval/filters/extraction.py", line 46, in <lambda>
    filtered_resps = list(map(lambda x: filter_set(x), resps))
  File "/lfs/skampere1/0/rschaef/KoyejoLab-Predictable-LLM-Evals/submodules/lm-evaluation-harness/lm_eval/filters/extraction.py", line 38, in filter_set
    match = [m for m in match if m][0]
IndexError: list index out of range
yinsong1986 commented 3 months ago

git checkout tags/v0.4.1 using v0.4.1 might fix the issue.

yinsong1986 commented 3 months ago

This is because