Having issues with MMLU benchmark

berkatil commented 1 month ago

Hi,

I am trying to run some LLMs (currently trying openai models) on MMLU. My first question is which configuration is the standard setup (5 shot without CoT)? What does flan mean in some of the configurations?

Also, when I tried running lm_eval --model openai-chat-completions --model_args model=gpt-3.5-turbo --num_fewshot 5 --tasks mmlu_college_mathematics, I got sys.exit(cli_evaluate()) File "/Users/bka5352/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate results = evaluator.simple_evaluate( File "/Users/bka5352/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper return fn(*args, **kwargs) File "/Users/bka5352/lm-evaluation-harness/lm_eval/evaluator.py", line 296, in simple_evaluate results = evaluate( File "/Users/bka5352/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper return fn(*args, **kwargs) File "/Users/bka5352/lm-evaluation-harness/lm_eval/evaluator.py", line 468, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/Users/bka5352/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 475, in loglikelihood raise NotImplementedError("No support for logits.") NotImplementedError: No support for logits. error .

When I tried running lm_eval --model openai-chat-completions --model_args model=gpt-3.5-turbo --num_fewshot 3 --tasks mmlu_flan_n_shot_generative_college_mathematics I got this error: File "/Users/bka5352/miniconda3/envs/test/bin/lm_eval", line 8, in <module> sys.exit(cli_evaluate()) File "/Users/bka5352/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate results = evaluator.simple_evaluate( File "/Users/bka5352/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper return fn(*args, **kwargs) File "/Users/bka5352/lm-evaluation-harness/lm_eval/evaluator.py", line 227, in simple_evaluate task_dict = get_task_dict(tasks, task_manager) File "/Users/bka5352/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 616, in get_task_dict task_name_from_string_dict = task_manager.load_task_or_group( File "/Users/bka5352/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 410, in load_task_or_group collections.ChainMap(*map(self._load_individual_task_or_group, task_list)) File "/Users/bka5352/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 310, in _load_individual_task_or_group return _load_task(task_config, task=name_or_config) File "/Users/bka5352/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 276, in _load_task task_object = ConfigurableTask(config=config) File "/Users/bka5352/lm-evaluation-harness/lm_eval/api/task.py", line 837, in __init__ filter_pipeline = build_filter_ensemble(filter_name, components) File "/Users/bka5352/lm-evaluation-harness/lm_eval/filters/__init__.py", line 21, in build_filter_ensemble f = partial(get_filter(function), **kwargs) TypeError: the first argument must be callable

Your help is appreciated!

shizhediao commented 1 month ago

same question! seems that the support for mmlu_flan_cot_fewshot is not completed.

haileyschoelkopf commented 1 month ago

@lintangsutawika did recent changes break the Flan MMLU variants?

Regarding the "No support for logits" error, this is expected.

We evaluate all "multiple_choice" type tasks using loglikelihoods, and compare the probabilities of each of the fixed set of answers before selecting the most likely as the model's answer. Because many API models don't provide logits / logprobs / loglikelihoods as output, we can't evaluate them on this type of scoring approach.

Whether loglikelihood-based evals are "right" is pretty subjective and is generally debated, but it has a number of advantages especially when evaluating base LMs, because it's closer to language modeling as a task, and because you don't have a chance of parsing failures when scoring outputs. It's also a lot faster and cheaper than generating text for each document.

See Appendix A in this paper for more background on methodology! https://arxiv.org/abs/2405.14782

shizhediao commented 1 month ago

I see. thank you! yes, I encountered the second issue: TypeError: the first argument must be callable.

lintangsutawika commented 1 month ago

Not that I'm aware of aside of group configuration. Looks like this is an issue related to recent change in filters?

berkatil commented 1 month ago

Thanks for your quick responses! Could you also explain what "flan" refers to here?

skramer-dev commented 1 month ago

i am also getting the 'first argument must be callable' error with just a basic mmlu task setup and in this case it's a local model where the harness has access to logits. --model hf --model_args '"pretrained=<model_path>"' --num_fewshot 5 --tasks '"mmlu*"' --batch_size 10

This is the warning that preceeds the crash: 2024-07-18:14:58:56,689 WARNING [registry.py:192] filter `<class 'utils.MultiChoiceRegexFilter'>` is not registered!

lintangsutawika commented 1 month ago

@berkatil Flan refers to the prompt that was used to evaluate the Flan T5 models

@skramer-dev will check on missing filter. But if you want to evaluate on the default mmlu, use mmlu because mmlu* will call everything that has mmlu in it's name which is includes quite a few variations.

skramer-dev commented 1 month ago

@skramer-dev will check on missing filter. But if you want to evaluate on the default mmlu, use mmlu because mmlu* will call everything that has mmlu in it's name which is includes quite a few variations.

The first thing i tried was using just "mmlu" but that somehow linked to tinymmlu instead of the regular mmlu, hence the attempt at "mmlu*" after that. I am however seeing some more weird behaviour so i will mark this up to being some code version caching issue with my cloud env and not the harness.

tianyi-chen commented 3 weeks ago

I ran into the same TypeError: the first argument must be callable with mmlu_flan_n_shot_generative_stem. It seems that the support for filter as a utils function is gone after the introduction of the filter registry. lm_eval.api.registry.get_filter() will always look for the filter name in the registry, whereas the "filter name" is a function for mmlu flan generative tasks https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/flan_n_shot/generative/_mmlu_flan_generative_template_yaml#L15

cc @haileyschoelkopf @lintangsutawika

EleutherAI / lm-evaluation-harness

Having issues with MMLU benchmark #2094