Open berkatil opened 1 month ago
same question! seems that the support for mmlu_flan_cot_fewshot
is not completed.
@lintangsutawika did recent changes break the Flan MMLU variants?
Regarding the "No support for logits" error, this is expected.
We evaluate all "multiple_choice"
type tasks using loglikelihoods, and compare the probabilities of each of the fixed set of answers before selecting the most likely as the model's answer. Because many API models don't provide logits / logprobs / loglikelihoods as output, we can't evaluate them on this type of scoring approach.
Whether loglikelihood-based evals are "right" is pretty subjective and is generally debated, but it has a number of advantages especially when evaluating base LMs, because it's closer to language modeling as a task, and because you don't have a chance of parsing failures when scoring outputs. It's also a lot faster and cheaper than generating text for each document.
See Appendix A in this paper for more background on methodology! https://arxiv.org/abs/2405.14782
I see. thank you!
yes, I encountered the second issue: TypeError: the first argument must be callable
.
Not that I'm aware of aside of group configuration. Looks like this is an issue related to recent change in filters?
Thanks for your quick responses! Could you also explain what "flan" refers to here?
i am also getting the 'first argument must be callable' error with just a basic mmlu task setup and in this case it's a local model where the harness has access to logits.
--model hf --model_args '"pretrained=<model_path>"' --num_fewshot 5 --tasks '"mmlu*"' --batch_size 10
This is the warning that preceeds the crash:
2024-07-18:14:58:56,689 WARNING [registry.py:192] filter `<class 'utils.MultiChoiceRegexFilter'>` is not registered!
@berkatil Flan refers to the prompt that was used to evaluate the Flan T5 models
@skramer-dev will check on missing filter. But if you want to evaluate on the default mmlu, use mmlu
because mmlu*
will call everything that has mmlu
in it's name which is includes quite a few variations.
@skramer-dev will check on missing filter. But if you want to evaluate on the default mmlu, use
mmlu
becausemmlu*
will call everything that hasmmlu
in it's name which is includes quite a few variations.
The first thing i tried was using just "mmlu" but that somehow linked to tinymmlu instead of the regular mmlu, hence the attempt at "mmlu*" after that. I am however seeing some more weird behaviour so i will mark this up to being some code version caching issue with my cloud env and not the harness.
I ran into the same TypeError: the first argument must be callable
with mmlu_flan_n_shot_generative_stem
. It seems that the support for filter as a utils function is gone after the introduction of the filter registry. lm_eval.api.registry.get_filter()
will always look for the filter name in the registry, whereas the "filter name" is a function for mmlu flan generative tasks https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/flan_n_shot/generative/_mmlu_flan_generative_template_yaml#L15
cc @haileyschoelkopf @lintangsutawika
Hi,
I am trying to run some LLMs (currently trying openai models) on MMLU. My first question is which configuration is the standard setup (5 shot without CoT)? What does flan mean in some of the configurations?
Also, when I tried running
lm_eval --model openai-chat-completions --model_args model=gpt-3.5-turbo --num_fewshot 5 --tasks mmlu_college_mathematics
, I gotsys.exit(cli_evaluate()) File "/Users/bka5352/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate results = evaluator.simple_evaluate( File "/Users/bka5352/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper return fn(*args, **kwargs) File "/Users/bka5352/lm-evaluation-harness/lm_eval/evaluator.py", line 296, in simple_evaluate results = evaluate( File "/Users/bka5352/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper return fn(*args, **kwargs) File "/Users/bka5352/lm-evaluation-harness/lm_eval/evaluator.py", line 468, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/Users/bka5352/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 475, in loglikelihood raise NotImplementedError("No support for logits.") NotImplementedError: No support for logits.
error .When I tried running
lm_eval --model openai-chat-completions --model_args model=gpt-3.5-turbo --num_fewshot 3 --tasks mmlu_flan_n_shot_generative_college_mathematics
I got this error:File "/Users/bka5352/miniconda3/envs/test/bin/lm_eval", line 8, in <module> sys.exit(cli_evaluate()) File "/Users/bka5352/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate results = evaluator.simple_evaluate( File "/Users/bka5352/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper return fn(*args, **kwargs) File "/Users/bka5352/lm-evaluation-harness/lm_eval/evaluator.py", line 227, in simple_evaluate task_dict = get_task_dict(tasks, task_manager) File "/Users/bka5352/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 616, in get_task_dict task_name_from_string_dict = task_manager.load_task_or_group( File "/Users/bka5352/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 410, in load_task_or_group collections.ChainMap(*map(self._load_individual_task_or_group, task_list)) File "/Users/bka5352/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 310, in _load_individual_task_or_group return _load_task(task_config, task=name_or_config) File "/Users/bka5352/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 276, in _load_task task_object = ConfigurableTask(config=config) File "/Users/bka5352/lm-evaluation-harness/lm_eval/api/task.py", line 837, in __init__ filter_pipeline = build_filter_ensemble(filter_name, components) File "/Users/bka5352/lm-evaluation-harness/lm_eval/filters/__init__.py", line 21, in build_filter_ensemble f = partial(get_filter(function), **kwargs) TypeError: the first argument must be callable
Your help is appreciated!