EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.82k stars 1.81k forks source link

mmlu_pro regex in template does not work #2237

Closed lxning closed 2 months ago

lxning commented 2 months ago

I got score 0 when I ran the following command. However, the response in the log_sample does contain "answer is (A)".

Command:

lm_eval --model hf     --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct --tasks mmlu_pro_biology --batch_size auto --gen_kwargs max_gen_toks=512,do_sample=true,temperature=0.6,max_length=2048 --limit 5 --num_fewshot 5 --output_path ~/output --write_out --log_samples --apply_chat_template
| Tasks |Version|    Filter    |n-shot|  Metric   |   |Value|   |Stderr|
|-------|------:|--------------|-----:|-----------|---|----:|---|-----:|
|biology|      0|custom-extract|     5|exact_match|↑  |    0|±  |     0|

sample log

   "resps": [
    [
      "To determine which of the following would most likely provide examples of mitotic cell divisions, let's think about the types of cell divisions that occur in different tissues.\n\n* Mitotic cell divisions occur in tissues that are growing, repairing, or replacing cells.\n* Muscle tissue (A) is composed of long, multinucleated cells that are not typically undergoing mitotic cell divisions.\n* Shoot tips (B) are areas of rapid growth and cell division, and are a good example of mitotic cell divisions.\n* Leaf veins (C) are composed of vascular tissue, which is responsible for transporting nutrients and water throughout the plant. While some cell divisions may occur in this tissue, it is not typically characterized by rapid growth.\n* Fruits (D) are composed of mature cells that have stopped dividing.\n* Leaves (E) are composed of mature cells that have stopped dividing.\n* Petals (F) are composed of mature cells that have stopped dividing.\n* Seeds (G) are composed of mature cells that have stopped dividing.\n* Anthers (H) are the male reproductive structures of a flower, and are responsible for producing pollen. Like shoot tips, they are areas of rapid growth and cell division.\n\nBased on this information, the most likely examples of mitotic cell divisions would be found in a longitudinal section of a shoot tip (B), which is an area of rapid growth and cell division.\n\nThe answer is (B)."
    ]
  ],
  "filtered_resps": [
    "[invalid]"
  ],
  "doc_hash": "e54e359f68dd1d100107c59662422df45a415bdebdfd5f1d0a75b69298f6cc5e",
  "prompt_hash": "07fbb050a310c4c4e7aded748e0fb79d9a48f7acb6d70a39d8187522bd0e5121",
  "target_hash": "df7e70e5021544f4834bbee64a9e3789febc4be81470df629cad6ddb03320a5c",
  "exact_match": 0
}
liewziqin commented 2 months ago

I think this is because the fewshot context is not constructed properly, which I have included the details here (#2196).

lintangsutawika commented 2 months ago

@liewziqin you mentioned a fix, in https://github.com/EleutherAI/lm-evaluation-harness/pull/2238 but the issue still persist due to fewshot context, is that right?

liewziqin commented 2 months ago

@lintangsutawika Yes, I think the issue is not the regex pattern, currently the fewshot samples in our prompt do not include CoT context and "answer is" which is essential for the regex to work. Sample of current prompt:

The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.Question:\nAs of 2017, how many of the world’s 1-year-old children today have been vaccinated against some disease? *\nOptions:\nA. 30%\nB. 60%\nC. 10%\nD. 90%\nE. 80%\nF. 40%\nG. 100%\nH. 50%\nI. N/A\nJ. N/A\nAnswer: Let's think step by step. E\n\nQuestion:\nWhich one of the following items is an example of nonmaterial culture?\nOptions:\nA. A dove feather\nB. Dove symbol\nC. Dove body lotion\nD. Dove deodorant\nE. Dove soap\nF. Dove candy bar\nG. Dove conditioner\nH. A dove (bird).\nI. Dove chocolate\nJ. Dove shampoo\nAnswer: Let's think step by step. B\n\nQuestion:\nWhich of the following cases established the precedent that a defendant must be informed of the right to remain silent, the right to a lawyer, and protection from self-incrimination?\nOptions:\nA. Brown v. Board of Education\nB. Miranda v. Arizona\nC. Roe v. Wade\nD. Betts v. Brady\nE. Plessy v. Ferguson\nF. Dred Scott v. Sandford\nG. Weeks v. United States\nH. Gideon v. Wainwright\nI. Marbury v. Madison\nJ. Mapp v. Ohio\nAnswer: Let's think step by step. B\n\nQuestion:

As shown in the sample, the answer format in our fewshot samples is "Answer: Let's think step by step. E", the model might not output the answer in the format of "answer is ([A-J])". Therefore I think we should correct the fewshot context construction as per (#2196).

eyuansu62 commented 2 months ago

@liewziqin from the latest code, the input prompt seems contain the "the answer is xxx".

liewziqin commented 2 months ago

@eyuansu62 I see, thanks for informing, I haven't tried out latest code yet

eyuansu62 commented 2 months ago

However, the latest code is generating unexpected blank spaces between each section, like this: 'The answer is (D).\n\n Question:'. I believe it would be better if there were no blank spaces between '\n\n' and 'Question'.

lintangsutawika commented 2 months ago

I've gone ahead and merged #2238 which also includes a fix related to the blank space after \n\n.