Open berkatil opened 3 months ago
I need to improve the heuristic answer extraction / filtration code for the fewshot BBH variants. The Zeroshot variants have better answer extraction, which matters when your model doesn't stick to the expected "default" output format.
Got it, thank you!
@haileyschoelkopf
I'm using this command:
lm_eval --model local-chat-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=https://api.deepinfra.com/v1/openai --tasks bbh_cot_zeroshot --verbosity INFO --log_samples --output_path "logs" --limit=50
and it throws an error:
Traceback (most recent call last):
File "/Users/raphaelkalandadze/anaconda3/envs/tinystory/bin/lm_eval", line 8, in <module>
sys.exit(cli_evaluate())
^^^^^^^^^^^^^^
File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/__main__.py", line 375, in cli_evaluate
results = evaluator.simple_evaluate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/utils.py", line 395, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/evaluator.py", line 221, in simple_evaluate
task_dict = get_task_dict(tasks, task_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 444, in get_task_dict
task_name_from_string_dict = task_manager.load_task_or_group(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 287, in load_task_or_group
collections.ChainMap(*map(self._load_individual_task_or_group, task_list))
File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 270, in _load_individual_task_or_group
**dict(collections.ChainMap(*map(fn, subtask_list))),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 178, in _load_individual_task_or_group
return load_task(task_config, task=name_or_config, group=parent_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 167, in load_task
task_object = ConfigurableTask(config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/api/task.py", line 825, in __init__
filter_pipeline = build_filter_ensemble(filter_name, components)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/filters/__init__.py", line 21, in build_filter_ensemble
f = partial(get_filter(function), **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: the first argument must be callable
am I missing something? I tried the bbh_cot_fewshot as well it's running but em is almost 0
Hi, I am getting a similar error as well. I am using version 0.4.3
. This is the command I am running
lm_eval --model vllm --model_args pretrained=sarvamai/meta-llama/Meta-Llama-3.1-8B-Instruct --tasks bbh_zeroshot_word_sorting --num_fewshot 0 --device cuda --batch_size 4 --seed 1 --output_path output/bbh_llama --log_samples
Please find the stacktrace
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/u/rudra/lm-eval-harness/scripts/write_out.py", line 97, in <module>
main()
File "/u/rudra/lm-eval-harness/scripts/write_out.py", line 51, in main
task_dict = tasks.get_task_dict(task_names, task_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/u/rudra/lm-eval-harness/lm_eval/tasks/__init__.py", line 444, in get_task_dict
task_name_from_string_dict = task_manager.load_task_or_group(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/u/rudra/lm-eval-harness/lm_eval/tasks/__init__.py", line 287, in load_task_or_group
collections.ChainMap(*map(self._load_individual_task_or_group, task_list))
File "/u/rudra/lm-eval-harness/lm_eval/tasks/__init__.py", line 178, in _load_individual_task_or_group
return load_task(task_config, task=name_or_config, group=parent_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/u/rudra/lm-eval-harness/lm_eval/tasks/__init__.py", line 167, in load_task
task_object = ConfigurableTask(config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/u/rudra/lm-eval-harness/lm_eval/api/task.py", line 825, in __init__
filter_pipeline = build_filter_ensemble(filter_name, components)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/u/rudra/lm-eval-harness/lm_eval/filters/__init__.py", line 21, in build_filter_ensemble
f = partial(get_filter(function), **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: the first argument must be callable
Hi,
I am trying to run openai models on some of the tasks in BBH. I thought cot_fewshot is supposed to be the standard setup with num_fewshot 3. However, when I run
lm_eval --model openai-chat-completions --model_args model=gpt-4o --num_fewshot 3 --tasks bbh_cot_fewshot_geometric_shapes
I got 0 as an exact match score? what should be the correct command?