EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.9k stars 1.84k forks source link

About the number of bbh task #946

Closed sglucas closed 11 months ago

sglucas commented 1 year ago

Hi

I fing the official bbh contrains 23 tasks and I just find 20 tasks in this repo. Do you plan to add more tasks in this repo?

StellaAthena commented 1 year ago

Which three are we missing?

Jiaqi0109 commented 11 months ago

Which three are we missing?

Hi I noticed a discrepancy in the file names and found that 6 tasks are missing. boolean_expressions、multistep_arithmetic_two、object_counting、penguins_in_a_table、web_of_lies、word_sorting https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/bbh

memray commented 11 months ago

@StellaAthena also it seems the number of examples doesn't match. Most datasets listed at original BBH repo (BIG-Bench-Hard) have 250 data points, but, for example, current dyck_languages has 1000 examples. One possible reason is the current resources are from bigbench instead of the BBH.

haileyschoelkopf commented 11 months ago

Hi! In the big-refactor branch (soon to be next major version release) we support BBH as implemented in the BBH paper, such as 3-shot CoT with the 250 subselected examples and matching their prompt: https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/lm_eval/tasks/bbh/flan_cot_fewshot