EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.9k stars 1.84k forks source link

`mmlu_flan` tasks fail #2460

Open eminorhan opened 7 hours ago

eminorhan commented 7 hours ago

I'm new to this library. I installed the library as instructed on the README page. Running:

lm_eval --model hf --model_args pretrained=gpt2 --tasks mmlu_flan_cot_zeroshot

results in the following error:

INFO:lm-eval:Verbosity set to INFO
INFO:lm-eval:The tag 'arc_ca' is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
INFO:lm-eval:The tag 'arc_ca' is already registered as a group, this tag will not be registered. This may affect tasks you want to call.
INFO:lm-eval:Selected Tasks: ['mmlu_flan_cot_zeroshot']
INFO:lm-eval:Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
INFO:lm-eval:Initializing hf model, with arguments: {'pretrained': 'gpt2'}
INFO:lm-eval:Using device 'cuda'
INFO:lm-eval:Using model type 'default'
INFO:lm-eval:Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
Using the latest cached version of the module from /lustre/orion/stf218/scratch/emin/huggingface/modules/datasets_modules/datasets/hails--mmlu_no_train/b7d5f7f21003c21be079f11495ee011332b980bd1cd7e70cc740e8c079e5bda2 (last modified on Tue Nov  5 09:34:31 2024) since it couldn't be found locally at hails/mmlu_no_train
WARNING:datasets.load:Using the latest cached version of the module from /lustre/orion/stf218/scratch/emin/huggingface/modules/datasets_modules/datasets/hails--mmlu_no_train/b7d5f7f21003c21be079f11495ee011332b980bd1cd7e70cc740e8c079e5bda2 (last modified on Tue Nov  5 09:34:31 2024) since it couldn't be found locally at hails/mmlu_no_train
WARNING:lm-eval:filter `<class 'utils.MultiChoiceRegexFilter'>` is not registered!
Traceback (most recent call last):
  File "/lustre/orion/stf218/scratch/emin/miniconda3/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/lustre/orion/stf218/scratch/emin/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
    results = evaluator.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf218/scratch/emin/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf218/scratch/emin/lm-evaluation-harness/lm_eval/evaluator.py", line 235, in simple_evaluate
    task_dict = get_task_dict(tasks, task_manager)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf218/scratch/emin/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 618, in get_task_dict
    task_name_from_string_dict = task_manager.load_task_or_group(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf218/scratch/emin/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 414, in load_task_or_group
    collections.ChainMap(*map(self._load_individual_task_or_group, task_list))
  File "/lustre/orion/stf218/scratch/emin/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 398, in _load_individual_task_or_group
    group_name: dict(collections.ChainMap(*map(fn, reversed(subtask_list))))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf218/scratch/emin/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 398, in _load_individual_task_or_group
    group_name: dict(collections.ChainMap(*map(fn, reversed(subtask_list))))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf218/scratch/emin/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 314, in _load_individual_task_or_group
    return _load_task(task_config, task=name_or_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf218/scratch/emin/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 280, in _load_task
    task_object = ConfigurableTask(config=config)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf218/scratch/emin/lm-evaluation-harness/lm_eval/api/task.py", line 833, in __init__
    filter_pipeline = build_filter_ensemble(filter_name, components)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf218/scratch/emin/lm-evaluation-harness/lm_eval/filters/__init__.py", line 21, in build_filter_ensemble
    f = partial(get_filter(function), **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: the first argument must be callable

Same issue mentioned in several other issues: e.g. https://github.com/EleutherAI/lm-evaluation-harness/issues/1885, https://github.com/EleutherAI/lm-evaluation-harness/issues/2094, so it seems to be a long-standing issue.

I tried several previous releases, but getting a panoply of errors each time. Is there any way at all to evaluate models on mmlu_flan tasks (a particular previous commit that demonstrably works, a relatively straightforward patch, etc.)? These are pretty fundamental tasks in LLM evals. It seems a bit unfortunate that the library persistently fails on this set of tasks.

baberabb commented 6 hours ago

Hi! Should be fixed but the PR

eminorhan commented 6 hours ago

Thank you very much for the quick response. I can confirm that the submitted PR fixes the issue for me.

eminorhan commented 3 hours ago

Hmm, I'm getting awfully low scores though. Accuracy for llama-3.1-8b seems to be in the single digits, whereas Meta reports it to be 73.0. Are we sure that the config for this task is set up correctly? It's hard for me to verify this, since I'm very new to this library. Would you be able to check the performance of the model meta-llama/Llama-3.1-8B for instance?

baberabb commented 3 hours ago

Hmm, I'm getting awfully low scores though. Accuracy for llama-3.1-8b seems to be in the single digits, whereas Meta reports it to be 73.0. Are we sure that the config for this task is set up correctly? It's hard for me to verify this, since I'm very new to this library. Would you be able to check the performance of the model meta-llama/Llama-3.1-8B for instance?

do you have a reference for this?

eminorhan commented 3 hours ago

Yes, this post by Meta reports the performance of llama-3.1-8b on the zero-shot CoT MMLU to be 73.0: https://ai.meta.com/blog/meta-llama-3-1/ That is supposed to be the same eval as mmlu_flan_cot_zeroshot, right?