EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
5.78k stars 1.54k forks source link

Improvements to MGSM #1391

Open leocnj opened 5 months ago

leocnj commented 5 months ago

https://github.com/EleutherAI/lm-evaluation-harness/blob/7411947112117e0339fe207fb620a70bcec22690/lm_eval/tasks/mgsm/direct/mgsm_direct_en.yaml#L3

According to my understanding, for target, we need to obtain a sub string of answer from the location 6+1. The existing jiajia-2, however, represent the char on location 7. Is this a bug?

baberabb commented 5 months ago

Hi! @juletx should be able to confirm but I think just using {{answer_number|string}} without the condition should work here. Not quite sure what we are indexing here. The COT prompts also seem to be set incorrectly.

answer:
Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
answer_number:
11
juletx commented 5 months ago

Hi @leocnj @baberabb! Not sure why it's implemented like that but I'd say it's a bug. None of the mgsm tasks generates the correct few-shot prompt when testing them with write_out with num_fewshot=5.

mgsm_direct_en: it should be Answer: answer_number

!!@@##@@!! -- Example 0
Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Answer-

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Answer-

Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
Answer-

Question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
Answer-

Question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?
Answer-

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Answer

mgsm_en_direct: Step-by-Step Answer should contain the full answer text

!!@@##@@!! -- Example 0
Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Step-by-Step Answer: T

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Step-by-Step Answer: R

Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
Step-by-Step Answer: J

Question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
Step-by-Step Answer: 5

Question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?
Step-by-Step Answer: M

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Step-by-Step Answer:

mgsm_en_native_cot: Step-by-Step Answer should contain the full answer text

!!@@##@@!! -- Example 0
Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Step-by-Step Answer: T

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Step-by-Step Answer: R

Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
Step-by-Step Answer: J

Question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
Step-by-Step Answer: 5

Question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?
Step-by-Step Answer: M

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Step-by-Step Answer:
baberabb commented 5 months ago

Thanks for the confirmation @juletx! Also looks like the \nAnswer: string in doc_to_text should be in the native language for the direct variation, which doesn't seem to be true for the majority of the languages.

leocnj commented 5 months ago

Did more reading on why [6+1] was used.

Inside mgsm/utils.py, we can find

"en": {  # English
        "QUESTION": "Question:",
        "ANSWER": "Step-by-Step Answer:",
        "DIRECT": "Answer:",
        "REGEX": "The answer is (\\-?[0-9\\.\\,]+)",

...

 yaml.dump(
                    {
                        "include": yaml_template,
                        "dataset_name": lang,
                        "task": f"mgsm_{lang}_direct",
                        "doc_to_text": f"""{{% if answer is not none %}}"""
                        f"""{{{{question+"\\n{ANSWER}"}}}}"""
                        f"""{{% else %}}"""
                        f"""{{{{"{QUESTION} "+question+"\\n{ANSWER}"}}}}"""
                        f"""{{% endif %}}""",
                        "doc_to_target": f"""{{% if answer is not none %}}"""
                        f"""{{{{answer[{len(ANSWER)}+1]}}}}"""
                        f"""{{% else %}}"""
                        f"""{{{{answer_number|string}}}}"""
                        f"""{{% endif %}}""",
                        **filter_list,
                    },

This will generate a yaml file like

# Generated by utils.py
dataset_name: en
doc_to_target: '{% if answer is not none %}{{answer[6+1]}}{% else %}{{answer_number|string}}{%
  endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer"}}{% else %}{{"Question:
  "+question+"\nAnswer"}}{% endif %}'
include: direct_yaml
task: mgsm_direct_en

I looks that answer[6+1] intends to skip the pre-defined ANSWER string, "Step-by-Step Answer:", which has a len of 6, from the answer string. However, for such purpose, we need use answer[6+1:]

baberabb commented 5 months ago

I looks that answer[6+1] intends to skip the pre-defined ANSWER string, "Step-by-Step Answer:", which has a len of 6, from the answer string. However, for such purpose, we need use answer[6+1:]

aah that makes more sense. "Step-by-Step Answer:" has 20 characters by my count though ("answer" has 6).

leocnj commented 4 months ago

Made a PR to fix the issues we observed above. Now generated contexts look normal.

For example, for mgsm_direct_en, new yaml file will be

!!@@##@@!! -- Example 1
Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Answer:Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Answer:There are 3 cars in the beginning, 2 more arrive, so now there should be 3 + 2 = 5 cars. The answer is 5.

Question: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?
Answer:

For mgsm_native_cot_zh, the new yaml file will be

!!@@##@@!! -- Example 1
问题:罗杰有 5 个网球。他又买了 2 罐网球。每罐有 3 个网球。他现在有多少个网球?
逐步解答: 杰一开始有 5 个球。2 罐各 3 个网球就是 6 个网球。5 + 6 = 11。答案是 11。

问题:如果停车场里有 3 辆车,又来了 2 辆车,停车场里有多少辆车?
逐步解答: 开始有 3 辆车,又来了 2 辆,所以现在应该有 3 + 2 = 5 辆车。答案是 5。

问题:服务器机房里有九台电脑。从周一到周四,每天又安装了五台电脑。服务器机房里现在有多少台电脑?
逐步解答:
nitsanluke commented 4 months ago
The filter on the native languages tasks should also need some updating which currently uses the English format for the answer. https://github.com/EleutherAI/lm-evaluation-harness/blob/a72babbfbddd9195748351892dced4f82fccbc0d/lm_eval/tasks/mgsm/native_cot/cot_yaml#L28 Hugginface dataset French few shot example: Question : Roger a 5 balles de tennis. Il achète 2 autres boîtes de balles de tennis en plus. Si chaque boîte contient 3 balles de tennis, combien de balles de tennis a-t-il maintenant ? Réponse étape par étape : Roger a commencé avec 5 balles. 2 boîtes de 3 balles de tennis chacune représentent 6 balles de tennis. 5 + 6 = 11. La réponse est 11. 11 5 + 6 = 11.

https://huggingface.co/datasets/juletxara/mgsm/viewer/fr?row=0

haileyschoelkopf commented 4 months ago

Agreed-- just taking the final number from the response (as in https://github.com/EleutherAI/lm-evaluation-harness/blob/a72babbfbddd9195748351892dced4f82fccbc0d/lm_eval/tasks/gsm8k/gsm8k-cot.yaml#L43-L48 ) would be an improvement already as well

naiarapm commented 2 months ago

Hi! Shouldn't doc_to_target simply be {{answer_number|string}} in direct task variants? Right now all examples for in-context learning include the step-by-step answers, also in the direct setup. If I understood correctly, I don't think that was the intended setup in the original article.

Native CoT

doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Question: "+question+"\nStep-by-Step Answer:"}}{% endif %}'

Example

!!@@##@@!! -- Example 0
Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Step-by-Step Answer: There are 3 cars in the beginning, 2 more arrive, so now there should be 3 + 2 = 5 cars. The answer is 5.

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Step-by-Step Answer:

✅ Independently of whether the evaluation is zero- or few-shot, the model is asked to generate a chain of thought. ✅ In few-shot, in-context examples will contain step-by-step answers.

Direct

doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'

Example:

!!@@##@@!! -- Example 0
Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Answer:There are 3 cars in the beginning, 2 more arrive, so now there should be 3 + 2 = 5 cars. The answer is 5.

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Answer:

✅ Independently of whether the evaluation is zero- or few-shot, the model is asked to directly output the answer. ❓ In few-shot, in-context examples will contain step-by-step answers. Models will then tend to also generate a CoT?

As it is now, the only difference between these two setups seems to be that the prompt says "Step-by-Step Answer:" for CoT, and "Answer:" for direct. Am I missing something? Thank you!