Closed kyleliang919 closed 9 months ago
Upon closer inspection, somehow all target gets added an extra space
another similar issue is with GSM8k itself, where target is not getting parsed properly, this has nothing to do with the model's generation. As you can see below the target is not getting parsed at all. replication cmd:
lm_eval --model openai-chat-completions --model_args model=gpt-3.5-turbo --tasks gsm8k --log_samples --output_path results/gpt_3.5_turbo_gsm8k --limit 0.01
{
"doc_id": 2,
"doc": {
"question": "Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?",
"answer": "The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000\nHe increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000\nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000\nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000\n#### 70000"
},
"target": "The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000\nHe increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000\nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000\nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000\n#### 70000",
"arguments": [
[
"Question: Tom can type 90 words a minute. A page is 450 words. How long would it take him to type out 10 pages?\nAnswer: He can type out 1 page in 450/90=<<450/90=5>>5 minutes\nSo it would take 5*10=<<5*10=50>>50 minutes\n#### 50\n\nQuestion: Stephanie is decorating 24 cupcakes for a birthday party, but she needs more candles. She currently has a total of 30 candles. She wants to decorate half of the cupcakes with 1 candle each and the other half of the cupcakes with 2 candles each. How many additional candles does Stephanie need to complete the cupcakes?\nAnswer: For half of the cupcakes, Stephanie wants to use 1 candle each. Since half of the cupcakes is 24/2 and she plans to use 1 candle each for this half of the cupcakes, Stephanie needs (24/2)*1 = <<24/2*1=12>>12 candles for this half of the cupcakes.\nFor the other half of the cupcakes, Stephanie wants to use 2 candles. Therefore, she will need (24/2)*2 =<<24/2*2=24>>24 candles for this half of the cupcakes.\nBecause Stephanie needs 12 candles for half of the cupcakes and 24 candles for the other half, she needs a total of 12+24=<<12+24=36>>36 candles.\nSince Stephanie needs 36 candles to decorate all the cupcakes and she currently has 30 candles, Stephanie needs 36-30= <<36-30=6>>6 additional candles.\n#### 6\n\nQuestion: A porcelain vase was originally priced at $200 but went on sale for 25% off. If Donna bought the porcelain vase and paid 10% sales tax, how much did she pay in total?\nAnswer: Donna got $200 x 25/100 = $<<200*25/100=50>>50 less for the porcelain vase.\nSo, the porcelain vase is now $200 - $50 = $<<200-50=150>>150.\nDonna has to pay $150 x 10/100 = $<<150*10/100=15>>15 for the sales tax.\nTherefore, she paid $150 + $15 = $<<150+15=165>>165 in total.\n#### 165\n\nQuestion: Michael makes birdhouses to sell at craft shows. He charges $22 for each large birdhouse, $16 for each medium birdhouse, and $7 for each small birdhouse. This week, he sold 2 large birdhouses, 2 medium birdhouses, and 3 small birdhouses. How much money, in dollars, did he make this week?\nAnswer: Michael sold 2 large birdhouses for $22 each, so he made 2*$22= $<<2*22=44>>44 from large birdhouse sales.\nMichael also sold 2 medium birdhouses for $16 each, so he made 2*$16= $<<2*16=32>>32 from medium birdhouse sales.\nMichael sold 3 small birdhouses for $7 each, so he made 3*7=$<<3*7=21>>21 from small birdhouse sales.\nSince Michael made $44 from large birdhouse sales, $32 from medium birdhouse sales, and $21 for small birdhouse sales, he made $44+$32+$21= $<<44+32+21=97>>97 total this week.\n#### 97\n\nQuestion: A pen is longer than the rubber by 3 centimeters, and shorter than the pencil by 2 centimeters. Considering the pencil is 12 centimeters long, what is the total length of the rubber, the pen, and the pencil altogether?\nAnswer: The pen is shorter than the pencil by 2 centimeters, so it's 12 - 2 = <<12-2=10>>10 centimeters long.\nThat means the rubber is 10 - 3 = <<10-3=7>>7 centimeters long\nSo all accessories are 7 + 10 + 12 = <<7+10+12=29>>29 centimeters long altogether.\n#### 29\n\nQuestion: Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make?\nAnswer:",
{
"until": [
"\n\n",
"Question:"
],
"do_sample": false,
"temperature": 0.0
}
]
],
"resps": [
[
"The increase in value of the house is 150% of $80,000, which is $80,000 * 150/100 = $<<80000*150/100=120000>>120,000.\nSo the total value of the house after repairs is $80,000 + $120,000 = $<<80000+120000=200000>>200,000.\nTherefore, Josh made a profit of $200,000 - $130,000 = $<<200000-130000=70000>>70,000. Answer: \\boxed{70,000}."
]
],
"filtered_resps": [
"[invalid]"
],
"exact_match": 0.0
},
Thanks for reporting this!
For the Chain-of-Thought GSM8k setting, this looks like a bug, and I can look into it.
For the latter example--currently the answer checking we use is fairly stringent across exact match tasks. Making this more loose to catch things such as GPT-3.5 responding in the "wrong" boxed format is on our todo list and we'll likely report both an "exact" / "stringent" accuracy as well as a more loose / flexible accuracy metric in future.
I'm away this week for the holidays, but going to heavily prioritize doing this pass and adding this more flexible answer checking as soon as I'm back. Sorry that this is causing confusion!
@haileyschoelkopf the gsm8k is what I was asking about in discord the other day.
Yup, the \boxed{} issue right?
May not be able to catch all the edge cases, but for gsm8k we can for example take the last number from the model’s output and score that probably.
Command to replicate:
Example of one of the mislabeled test entries, the answer was correct but exact match outputs 0.0