EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.84k stars 1.83k forks source link

GSM8k_CoT or Exact Match infra seems to be broken #1204

Closed kyleliang919 closed 9 months ago

kyleliang919 commented 10 months ago

Command to replicate:

lm_eval --model openai-chat-completions --model_args model=gpt-3.5-turbo --tasks gsm8k_cot --log_samples --output_path results/gpt_3.5_turbo_gsm8k_cot

Example of one of the mislabeled test entries, the answer was correct but exact match outputs 0.0

{
    "doc_id": 3,
    "doc": {
      "question": "James decides to run 3 sprints 3 times a week.  He runs 60 meters each sprint.  How many total meters does he run a week?",
      "answer": "He sprints 3*3=<<3*3=9>>9 times\nSo he runs 9*60=<<9*60=540>>540 meters\n#### 540"
    },
    "target": " 540",
    "arguments": [
      [
        "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\n\nA: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.\n\nQ: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\n\nA: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.\n\nQ: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\n\nA: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.\n\nQ: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?\n\nA: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8.\n\nQ: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?\n\nA: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9.\n\nQ: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?\n\nA: There were originally 9 computers. For each of 4 days, 5 more computers were added. So 5 * 4 = 20 computers were added. 9 + 20 is 29. The answer is 29.\n\nQ: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?\n\nA: Michael started with 58 golf balls. After losing 23 on tuesday, he had 58 - 23 = 35. After losing 2 more, he had 35 - 2 = 33 golf balls. The answer is 33.\n\nQ: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\n\nA: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8.\n\nQ: James decides to run 3 sprints 3 times a week.  He runs 60 meters each sprint.  How many total meters does he run a week?\n\nA:",
        {
          "temperature": 0.0
        }
      ]
    ],
    "resps": [
      [
        " James runs 3 sprints, 3 times a week. Each sprint is 60 meters. So 3 x 3 = 9 sprints. 9 x 60 = 540 meters. The answer is 540 meters."
      ]
    ],
    "filtered_resps": [
      "540"
    ],
    "exact_match": 0.0
  }
kyleliang919 commented 10 months ago

Upon closer inspection, somehow all target gets added an extra space

kyleliang919 commented 10 months ago

another similar issue is with GSM8k itself, where target is not getting parsed properly, this has nothing to do with the model's generation. As you can see below the target is not getting parsed at all. replication cmd:

lm_eval --model openai-chat-completions --model_args model=gpt-3.5-turbo --tasks gsm8k --log_samples --output_path results/gpt_3.5_turbo_gsm8k --limit 0.01
{
    "doc_id": 2,
    "doc": {
      "question": "Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?",
      "answer": "The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000\nHe increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000\nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000\nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000\n#### 70000"
    },
    "target": "The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000\nHe increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000\nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000\nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000\n#### 70000",
    "arguments": [
      [
        "Question: Tom can type 90 words a minute.  A page is 450 words.  How long would it take him to type out 10 pages?\nAnswer: He can type out 1 page in 450/90=<<450/90=5>>5 minutes\nSo it would take 5*10=<<5*10=50>>50 minutes\n#### 50\n\nQuestion: Stephanie is decorating 24 cupcakes for a birthday party, but she needs more candles. She currently has a total of 30 candles. She wants to decorate half of the cupcakes with 1 candle each and the other half of the cupcakes with 2 candles each. How many additional candles does Stephanie need to complete the cupcakes?\nAnswer: For half of the cupcakes, Stephanie wants to use 1 candle each. Since half of the cupcakes is 24/2 and she plans to use 1 candle each for this half of the cupcakes, Stephanie needs (24/2)*1 = <<24/2*1=12>>12 candles for this half of the cupcakes.\nFor the other half of the cupcakes, Stephanie wants to use 2 candles. Therefore, she will need (24/2)*2 =<<24/2*2=24>>24 candles for this half of the cupcakes.\nBecause Stephanie needs 12 candles for half of the cupcakes and 24 candles for the other half, she needs a total of 12+24=<<12+24=36>>36 candles.\nSince Stephanie needs 36 candles to decorate all the cupcakes and she currently has 30 candles, Stephanie needs 36-30= <<36-30=6>>6 additional candles.\n#### 6\n\nQuestion: A porcelain vase was originally priced at $200 but went on sale for 25% off. If Donna bought the porcelain vase and paid 10% sales tax, how much did she pay in total?\nAnswer: Donna got $200 x 25/100 = $<<200*25/100=50>>50 less for the porcelain vase.\nSo, the porcelain vase is now $200 - $50 = $<<200-50=150>>150.\nDonna has to pay $150 x 10/100 = $<<150*10/100=15>>15 for the sales tax.\nTherefore, she paid $150 + $15 = $<<150+15=165>>165 in total.\n#### 165\n\nQuestion: Michael makes birdhouses to sell at craft shows. He charges $22 for each large birdhouse, $16 for each medium birdhouse, and $7 for each small birdhouse. This week, he sold 2 large birdhouses, 2 medium birdhouses, and 3 small birdhouses. How much money, in dollars, did he make this week?\nAnswer: Michael sold 2 large birdhouses for $22 each, so he made 2*$22= $<<2*22=44>>44 from large birdhouse sales.\nMichael also sold 2 medium birdhouses for $16 each, so he made 2*$16= $<<2*16=32>>32 from medium birdhouse sales.\nMichael sold 3 small birdhouses for $7 each, so he made 3*7=$<<3*7=21>>21 from small birdhouse sales.\nSince Michael made $44 from large birdhouse sales, $32 from medium birdhouse sales, and $21 for small birdhouse sales, he made $44+$32+$21= $<<44+32+21=97>>97 total this week.\n#### 97\n\nQuestion: A pen is longer than the rubber by 3 centimeters, and shorter than the pencil by 2 centimeters. Considering the pencil is 12 centimeters long, what is the total length of the rubber, the pen, and the pencil altogether?\nAnswer: The pen is shorter than the pencil by 2 centimeters, so it's 12 - 2 = <<12-2=10>>10 centimeters long.\nThat means the rubber is 10 - 3 = <<10-3=7>>7 centimeters long\nSo all accessories are 7 + 10 + 12 = <<7+10+12=29>>29 centimeters long altogether.\n#### 29\n\nQuestion: Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?\nAnswer:",
        {
          "until": [
            "\n\n",
            "Question:"
          ],
          "do_sample": false,
          "temperature": 0.0
        }
      ]
    ],
    "resps": [
      [
        "The increase in value of the house is 150% of $80,000, which is $80,000 * 150/100 = $<<80000*150/100=120000>>120,000.\nSo the total value of the house after repairs is $80,000 + $120,000 = $<<80000+120000=200000>>200,000.\nTherefore, Josh made a profit of $200,000 - $130,000 = $<<200000-130000=70000>>70,000. Answer: \\boxed{70,000}."
      ]
    ],
    "filtered_resps": [
      "[invalid]"
    ],
    "exact_match": 0.0
  },
haileyschoelkopf commented 10 months ago

Thanks for reporting this!

For the Chain-of-Thought GSM8k setting, this looks like a bug, and I can look into it.

For the latter example--currently the answer checking we use is fairly stringent across exact match tasks. Making this more loose to catch things such as GPT-3.5 responding in the "wrong" boxed format is on our todo list and we'll likely report both an "exact" / "stringent" accuracy as well as a more loose / flexible accuracy metric in future.

haileyschoelkopf commented 10 months ago

I'm away this week for the holidays, but going to heavily prioritize doing this pass and adding this more flexible answer checking as soon as I'm back. Sorry that this is causing confusion!

anjor commented 10 months ago

@haileyschoelkopf the gsm8k is what I was asking about in discord the other day.

haileyschoelkopf commented 10 months ago

Yup, the \boxed{} issue right?

May not be able to catch all the edge cases, but for gsm8k we can for example take the last number from the model’s output and score that probably.