Issues with BBH benchmark

berkatil commented 3 months ago

Hi,

I am trying to run openai models on some of the tasks in BBH. I thought cot_fewshot is supposed to be the standard setup with num_fewshot 3. However, when I run lm_eval --model openai-chat-completions --model_args model=gpt-4o --num_fewshot 3 --tasks bbh_cot_fewshot_geometric_shapes I got 0 as an exact match score? what should be the correct command?

haileyschoelkopf commented 3 months ago

I need to improve the heuristic answer extraction / filtration code for the fewshot BBH variants. The Zeroshot variants have better answer extraction, which matters when your model doesn't stick to the expected "default" output format.

berkatil commented 3 months ago

Got it, thank you!

RRaphaell commented 2 months ago

@haileyschoelkopf

I'm using this command:

lm_eval --model local-chat-completions --model_args model=meta-llama/Meta-Llama-3.1-8B-Instruct,base_url=https://api.deepinfra.com/v1/openai --tasks bbh_cot_zeroshot --verbosity INFO --log_samples --output_path "logs" --limit=50

and it throws an error:

Traceback (most recent call last):
  File "/Users/raphaelkalandadze/anaconda3/envs/tinystory/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/__main__.py", line 375, in cli_evaluate
    results = evaluator.simple_evaluate(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/utils.py", line 395, in _wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/evaluator.py", line 221, in simple_evaluate
    task_dict = get_task_dict(tasks, task_manager)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 444, in get_task_dict
    task_name_from_string_dict = task_manager.load_task_or_group(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 287, in load_task_or_group
    collections.ChainMap(*map(self._load_individual_task_or_group, task_list))
  File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 270, in _load_individual_task_or_group
    **dict(collections.ChainMap(*map(fn, subtask_list))),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 178, in _load_individual_task_or_group
    return load_task(task_config, task=name_or_config, group=parent_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/tasks/__init__.py", line 167, in load_task
    task_object = ConfigurableTask(config=config)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/api/task.py", line 825, in __init__
    filter_pipeline = build_filter_ensemble(filter_name, components)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/raphaelkalandadze/Desktop/repos/lm-evaluation-harness/lm_eval/filters/__init__.py", line 21, in build_filter_ensemble
    f = partial(get_filter(function), **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: the first argument must be callable

am I missing something? I tried the bbh_cot_fewshot as well it's running but em is almost 0

murthyrudra commented 2 months ago

Hi, I am getting a similar error as well. I am using version 0.4.3. This is the command I am running

lm_eval --model vllm --model_args pretrained=sarvamai/meta-llama/Meta-Llama-3.1-8B-Instruct --tasks bbh_zeroshot_word_sorting --num_fewshot 0 --device cuda --batch_size 4 --seed 1 --output_path output/bbh_llama --log_samples

Please find the stacktrace

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/u/rudra/lm-eval-harness/scripts/write_out.py", line 97, in <module>
    main()
  File "/u/rudra/lm-eval-harness/scripts/write_out.py", line 51, in main
    task_dict = tasks.get_task_dict(task_names, task_manager)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/u/rudra/lm-eval-harness/lm_eval/tasks/__init__.py", line 444, in get_task_dict
    task_name_from_string_dict = task_manager.load_task_or_group(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/u/rudra/lm-eval-harness/lm_eval/tasks/__init__.py", line 287, in load_task_or_group
    collections.ChainMap(*map(self._load_individual_task_or_group, task_list))
  File "/u/rudra/lm-eval-harness/lm_eval/tasks/__init__.py", line 178, in _load_individual_task_or_group
    return load_task(task_config, task=name_or_config, group=parent_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/u/rudra/lm-eval-harness/lm_eval/tasks/__init__.py", line 167, in load_task
    task_object = ConfigurableTask(config=config)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/u/rudra/lm-eval-harness/lm_eval/api/task.py", line 825, in __init__
    filter_pipeline = build_filter_ensemble(filter_name, components)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/u/rudra/lm-eval-harness/lm_eval/filters/__init__.py", line 21, in build_filter_ensemble
    f = partial(get_filter(function), **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: the first argument must be callable

murthyrudra commented 2 months ago

On looking into the code, I see the issue is with this function here. This expects a string as input and searches in the corresponding FILTER_REGISTRY for a match.

However for BBH tasks, we have are calling custom classes as functions as can be seen here

EleutherAI / lm-evaluation-harness

Issues with BBH benchmark #2095