fgenie / rims_minimal

시작이 절반이고 마무리 또한 절반이다.
0 stars 1 forks source link

MATH 결과 #13

Closed seanexp closed 8 months ago

seanexp commented 8 months ago

baseline, chatgpt

total: 459 / 871
fail: 0 / 871
nonconflict: 13 / 67
conflict: 446 / 804

rims, chatgpt

total: 433 / 866
fail: 0 / 866
nonconflict: 15 / 68
conflict: 418 / 798

baseline, gpt4

total: 598 / 871
fail: 0 / 871
nonconflict: 13 / 42
conflict: 585 / 829

rims, gpt4

total: 556 / 868
fail: 0 / 868
nonconflict: 14 / 45
conflict: 542 / 823
seanexp commented 8 months ago

rims 를 돌릴 때 이런 샘플들이 생깁니다.

'index'
Object of type KeyError is not JSON serializable
'index'
Object of type KeyError is not JSON serializable
'index'
Object of type KeyError is not JSON serializable
fgenie commented 8 months ago

1,2 에 관해 해결한 후 반영하겠습니다. @seanexp

1) MATH 결과 무결성: buggy conflict decision --> buggy results

며칠전에 추가해주신 evaluation의 sympy equivalence가 a1, a2 비교시에 둘의 타입을 string으로 전제하는 것 같습니다. 저희 execution결과는 float, str, NoneType이라서 conflict가 아니어야하는 상황에도 conflict처리가 된 것으로 보입니다. (get_concordant_answer() 함수에서도 해당 equivalence를 사용중)

REPL: is_equiv() 확인

>>> from utils.math_util import *
>>> a1,a2 = "\\frac{1}{2}", 0.5
>>> is_equiv(normalize_final_answer(a1), normalize_final_answer("0.5"))
True
>>> normalize_final_answer(a1)
'\\frac{1}{2}'
>>> is_equiv(normalize_final_answer(a1), 0.5)
False # unexpected
# `is_equiv` expects string (normalized Tex) inputs as answers
>>> is_equiv(normalize_final_answer(a1), normalize_final_answer(0.5)) 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/seonils/dev/rims_minimal/src/utils/math_util.py", line 138, in normalize_final_answer
    final_answer = final_answer.split("=")[-1]
                   ^^^^^^^^^^^^^^^^^^
AttributeError: 'float' object has no attribute 'split'
>>> is_equiv(normalize_final_answer(a1), "0.5")
True
>>> is_equiv(0.5, 0.5)
False # false negative
>>> is_equiv("0.5", "0.5")
True # expected
>>> is_equiv(a1, "0.5")
True # expected
>>> is_equiv(a1, "0.49999")
False #expected

chatgpt_01_21_20_51_MATH-prealgebra_rims_startidx0.jsonl:L3

{
  "question": "What is the least common multiple of the first ten positive integers?",
  "level": "Level 4",
  "type": "Prealgebra",
  "solution": "Since $1$ divides every integer, we need only find the least common multiple (LCM) of the numbers $2$ through $10$.  Their prime factorizations are, respectively, $2, 3, 2^2, 5, 2\\cdot 3, 7, 2^3, 3^2, 2\\cdot 5$.   In its own prime factorization, the LCM must have each prime that appears in this list, raised to at minimum the highest power that appears in the list for that prime, in order for it to divide all of the integers from $2$ through $10$.  Thus, the prime factorization of the LCM is $2^3\\cdot 3^2 \\cdot 5\\cdot 7$. So the LCM is $8\\cdot 9\\cdot 5\\cdot 7 = \\boxed{2520}$.",
  "answer": "2520",
  "ansmap": {
    "cot": 2520, //세 정답이 같으므로 이것이 majority_ans로 제출되고 non-conflict 케이스로 처리되어야 함
    "pal": 2520,
    "p2c": 2520
  },
  "solmap": {
    "cot": "To find the least common multiple (LCM) of the first ten positive integers, we can list out the prime factorization of each number and then take the highest power of each prime factor.\n\nThe prime factorization of the first ten positive integers is as follows:\n1 = 1\n2 = 2\n3 = 3\n4 = 2^2\n5 = 5\n6 = 2 * 3\n7 = 7\n8 = 2^3\n9 = 3^2\n10 = 2 * 5\n\nNow, we take the highest power of each prime factor:\n2^3 * 3^2 * 5 * 7 = 8 * 9 * 5 * 7 = 2520\n\nTherefore, the least common multiple of the first ten positive integers is 2520.",
    "pal": "import math\n\ndef solution():\n    \"\"\"What is the least common multiple of the first ten positive integers?\"\"\"\n    numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n    lcm = numbers[0]\n    for i in range(1, len(numbers)):\n        lcm = lcm * numbers[i] // math.gcd(lcm, numbers[i])\n    result = lcm\n    return result",
    "p2c": [
      "1. Start by listing the first ten positive integers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.\n2. Identify the prime factors of each number:\n   - 1: No prime factors (1 is not a prime number).\n   - 2: Prime factor is 2.\n   - 3: Prime factor is 3.\n   - 4: Prime factors are 2 and 2.\n   - 5: Prime factor is 5.\n   - 6: Prime factors are 2 and 3.\n   - 7: Prime factor is 7.\n   - 8: Prime factors are 2, 2, and 2.\n   - 9: Prime factors are 3 and 3.\n   - 10: Prime factors are 2 and 5.\n3. Identify the highest power of each prime factor that appears in any of the numbers:\n   - 2: Appears with the highest power of 3 (2^3 = 8).\n   - 3: Appears with the highest power of 2 (3^2 = 9).\n   - 5: Appears with the highest power of 1 (5^1 = 5).\n   - 7: Appears with the highest power of 1 (7^1 = 7).\n4. Multiply the prime factors raised to their highest powers:\n   - 2^3 * 3^2 * 5^1 * 7^1 = 8 * 9 * 5 * 7 = 2520.\n5. The least common multiple (LCM) of the first ten positive integers is 2520. Return this value as the answer.\ndef solution():\n    lcm = 2**3 * 3**2 * 5**1 * 7**1\n    return lcm"
    ]
  },
  "selection_or_rims": { //"majority_vote": true 로 처리되어야 함. 여기서는 RIMS selection을 거쳤음.
    "good_solution": "To find the least common multiple (LCM) of the first ten positive integers, we can use the prime factorization method.\n\n1. Write down the prime factorization of each number from 1 to 10:\n   - 1 = 1\n   - 2 = 2\n   - 3 = 3\n   - 4 = 2^2\n   - 5 = 5\n   - 6 = 2 * 3\n   - 7 = 7\n   - 8 = 2^3\n   - 9 = 3^2\n   - 10 = 2 * 5\n\n2. Identify the highest power of each prime factor:\n   - 2: 2^3\n   - 3: 3^2\n   - 5: 5\n   - 7: 7\n\n3. Multiply the highest powers of each prime factor together:\n   LCM = 2^3 * 3^2 * 5 * 7\n\n4. Calculate the LCM:\n   LCM = 8 * 9 * 5 * 7\n   LCM = 2520",
    "good_ans": null,
    "good_method": "p2c",
    "bad_solutions": [],
    "bad_ans": [],
    "bad_method": [],
    "mistakes": [],
    "hint": [],
    "did_reflect": 0,
    "raw_query_out": "`Method`: Plan-and-then-Code (p2c)\n\n`Attempt 1`:\nTo find the least common multiple (LCM) of the first ten positive integers, we can use the prime factorization method.\n\n1. Write down the prime factorization of each number from 1 to 10:\n   - 1 = 1\n   - 2 = 2\n   - 3 = 3\n   - 4 = 2^2\n   - 5 = 5\n   - 6 = 2 * 3\n   - 7 = 7\n   - 8 = 2^3\n   - 9 = 3^2\n   - 10 = 2 * 5\n\n2. Identify the highest power of each prime factor:\n   - 2: 2^3\n   - 3: 3^2\n   - 5: 5\n   - 7: 7\n\n3. Multiply the highest powers of each prime factor together:\n   LCM = 2^3 * 3^2 * 5 * 7\n\n4. Calculate the LCM:\n   LCM = 8 * 9 * 5 * 7\n   LCM = 2520\n\n`Answer 1`: 2520\n",
    "query_msg": [
      {
        "role": "user",
        "content": "You are now solving math word problems. You brilliantly detects the errors in the wrong solution and find `Workaround Method` to correct the solution. The methods you are taking are as follows. Each has its strength and weakness:\n\n- Chain of Thought (cot): Solving problem with writing steps of reasoning in a natural language. Might help correct understanding of the problem but this could be weaker in precise computation.\n- Program-aided Language Modeling (pal): Using python language to reason and obtain an accurate answer to the given question, but this could be weaker in understanding the problem.\n- Plan-and-then-Code (p2c): When a question seems requiring amount of steps to reach the answer, write plans first for what to compute and write a python code to it for solving the problem. However if planning goes wrong, the code will also be wrong. If any steps of planning provided before programming, then it will be considered as Plan-and-then-Code.\n\nFollowings are the examples of correcting the wrong solutions with a `Workaround Method` based on diagnosis (`Mistakes`) and `Hint for a better Method choice`.\n\n\n\n\n`Question`: Nedy can eat 8 packs of crackers from Monday to Thursday. If Nedy ate twice the amount on Friday, how many crackers did Nedy eat in all?\n`Method`: Chain-of-Thought (cot)\n`Attempt 1`: Answer:\nNedy can eat 8 packs of crackers from Monday to Thursday.\nOn Friday, Nedy ate twice the amount, which means 2 * 8 = 16 packs of crackers on Friday.\nSo in total, Nedy ate 8 (from Monday to Thursday) + 16 (on Friday) = 24 packs of crackers.\nSo the answer is 24.\n`Answer 1`: 24.0\n`Evaluation`: Wrong\n`Mistakes`: The mistake in the original attempt is in misunderstanding the problem statement. It is stated that Nedy can eat 8 packs of crackers from Monday to Thursday, which suggests that Nedy eats 8 packs each day, not for the whole period from Monday to Thursday. The calculation was based on the incorrect assumption that the 8 packs were for the entire four days.\n`Hint for a better Method choice`: A program-aided language model or a planned coding approach would help carry out the calculations more systematically to avoid such misunderstandings and provide accurate arithmetic computations.\n`Workaround Method`: Plan-and-then-Code (p2c)\n`Attempt 2`: 1. Start with a variable `crackers_per_day` and set its value to 8 since Nedy can eat 8 packs of crackers each day.\n2. Calculate the total number of crackers eaten from Monday to Thursday by multiplying `crackers_per_day` by 4 (the number of days from Monday to Thursday). Store this calculation in a new variable `crackers_mon_to_thu`.\n3. Calculate the number of crackers eaten on Friday by multiplying `crackers_per_day` by 2 since it's twice the daily amount. Store this in another variable `crackers_friday`.\n4. Add `crackers_mon_to_thu` and `crackers_friday` together to get the total number of crackers eaten in the whole week. Store this in a variable `total_crackers`.\n5. Return the value of `total_crackers` as the answer.\ndef solution():\n    crackers_per_day = 8\n    crackers_mon_to_thu = crackers_per_day * 4\n    crackers_friday = crackers_per_day * 2\n    total_crackers = crackers_mon_to_thu + crackers_friday\n\n    return total_crackers\n\n`Answer 2`: 48.0\n`Evaluation`: Correct\n\n\n\n`Question`: Two white socks cost 25 cents more than a single brown sock. If two white socks cost 45 cents, how much would you pay for 15 brown socks?\n`Method`: Program-aided Language Modeling (pal)\n`Attempt 1`:\ndef solution():\n    cost_two_white_socks = 45  # cost in cents\n    cost_difference = 25  # cost difference in cents\n\n    # Since two white socks cost 25 cents more than a single brown sock,\n    # subtract the cost difference from the cost of two white socks to find the cost of one brown sock.\n    cost_one_brown_sock = cost_two_white_socks - cost_difference\n\n    # Find the cost for 15 brown socks by multiplying the cost of one brown sock by 15.\n    cost_fifteen_brown_socks = cost_one_brown_sock * 15\n    result = cost_fifteen_brown_socks\n    return result\n\n# Call the function and print the result\nprint(solution())\n\n`Answer 1`: 300.0\n`Evaluation`: Wrong\n`Mistakes`: The error in the first attempt is due to a misunderstanding of the problem statement. The cost of one brown sock should not be calculated by simply subtracting the cost difference from the cost of two white socks. Instead, the cost for two white socks should be divided by two to find the cost of one white sock first before finding the cost of a single brown sock.\n`Hint for a better Method choice`: Use Chain of Thought to clearly understand the relationship between the cost of brown and white socks before calculating the final cost for 15 brown socks. This will help avoid the pitfall of incorrect subtraction that led to the error in the initial attempt.\n`Workaround Method`: Chain-of-Thought (cot)\n`Attempt 2`: Answer:\nTwo white socks cost 45 cents.\nOne brown sock costs 25 cents less than two white socks.\nTherefore, the cost of one brown sock is 45 cents - 25 cents = 20 cents.\n\nIf one brown sock costs 20 cents, then 15 brown socks would cost:\n15 * 20 cents = 300 cents.\n\nSince 100 cents is equal to 1 dollar:\n300 cents is equal to 300 / 100 = $3.\n\nSo, you would pay $3 for 15 brown socks.\n`Answer 2`: 3.0\n`Evaluation`: Correct\n\n\n\n`Question`: Bonnie and Samuel went to the market together. Bonnie bought 8 apples. Samuel bought 20 more apples than Bonnie. Samuel then ate half of them and used 1/7 of them to make apple pie. How many apples does Samuel have left?\n`Method`: Program-aided Language Modeling (pal)\n`Attempt 1`:\ndef solution():\n    bonnie_apples = 8\n    samuel_apples_more = 20\n    samuel_apples_bought = bonnie_apples + samuel_apples_more\n    samuel_apples_after_eating_half = samuel_apples_bought / 2\n    apples_for_pie = samuel_apples_after_eating_half / 7\n    samuel_apples_left = samuel_apples_after_eating_half - apples_for_pie\n    result = samuel_apples_left\n    return result\n\n`Answer 1`: 12.0\n`Evaluation`: Wrong\n`Mistakes`: The mistake in the first attempt lies in the incorrect calculation of apples used for the pie. The attempt incorrectly divides the halved number of apples by 7 to determine the amount used for the pie when it should be calculated from the original total of apples Samuel bought.\n`Hint for a better Method choice`: A plan-and-then-code approach would ensure a clear step-by-step process is outlined before coding, which is crucial in a problem with multiple sequential calculations to avoid logical errors.\n`Workaround Method`: Plan-and-then-Code (p2c)\n`Attempt 2`: 1. Define a variable called `bonnie_apples` to represent the number of apples bought by Bonnie and assign it the value 8.\n2. Define a variable called `samuel_extra_apples` to represent the additional number of apples bought by Samuel compared to Bonnie and assign it the value 20.\n3. Calculate the total number of apples bought by Samuel by adding `bonnie_apples` and `samuel_extra_apples`. Let's call this `samuel_apples`.\n4. Define a variable `samuel_ate` to represent the number of apples Samuel ate, which is half of his apples, so `samuel_ate = samuel_apples / 2`.\n5. Define another variable `samuel_pie` to represent the number of apples used to make the pie, which is 1/7th of his apples, so `samuel_pie = samuel_apples / 7`.\n6. Calculate the number of apples Samuel has left by subtracting `samuel_ate` and `samuel_pie` from his total number of apples (`samuel_apples - samuel_ate - samuel_pie`).\n7. Return the final number of apples Samuel has.\ndef solution():\n    bonnie_apples = 8\n    samuel_extra_apples = 20\n\n    samuel_apples = bonnie_apples + samuel_extra_apples\n    samuel_ate = samuel_apples / 2\n    samuel_pie = samuel_apples / 7\n\n    samuel_left = samuel_apples - samuel_ate - samuel_pie\n\n    return samuel_left\n\n`Answer 2`: 10.0\n`Evaluation`: Correct\n\n\n\nNow, try the `Question` below following the same procedure as above. Try the question with the choice of your `Method`, and evaluate the `Answer`. If your `Attempt` is considered wrong, identify the `Mistakes` and reason to take `Workaround Method` by writing `Hint for a better Method choice`. Based on it, make a correct reattempt.\n\n`Question`: What is the least common multiple of the first ten positive integers?"
      }
    ]
  },
  "majority_ans": null,
  "prompt_file": "prompt_construction_src/prep_rims_prompts/gsm_prompts/3_reflectonce_cot2p2c.pal2cot.pal2p2c.txt_rm_ans",
  "inference_mode": "rims"
}

2) rims Key: 'index' error

gsm, svamp의 경우 row 별 'index' 필드가 존재합니다. 이를 후처리에 사용중인 것 때문에 생기는 문제라 생각됩니다. 수정 후에 반영하겠습니다.

fgenie commented 8 months ago

@jason9693 추가로 OCWcourses의 경우는 evaluation에 있어서 논문에 아예 함수를 제공했기 때문에 이것을 따로 활용하도록 반영해두겠습니다. Tex normalization은 같지만 eval threshold 조건이 조금 다릅니다.

지금은 OCW, MATH가 같은 방식으로 처리되도록 되어있으니 여기까지 반영되기 전에는 eval 결과를 제대로 보기 어렵습니다. Minerva pp. 26-27

seanexp commented 8 months ago
>>> is_equiv(normalize_final_answer(a1), normalize_final_answer(0.5)) 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/seonils/dev/rims_minimal/src/utils/math_util.py", line 138, in normalize_final_answer
    final_answer = final_answer.split("=")[-1]
                   ^^^^^^^^^^^^^^^^^^
AttributeError: 'float' object has no attribute 'split'

이 부분은 원래 에러가 raise 되어야 하는 부분이었던거 같은데 너무 예외처리가 많아서 못 잡은거 같네요

fgenie commented 8 months ago

6704ca14eee6f3a2faf23ce039bd064c72431e36 d846488277ca25067b700eb2ca1bce906833bf05 a2ad4b6abc43b28b516f91db0b50a843ef411ba4 82ecf3840dc9e8ecb1d6df94ec908198aed8c59e

으로 해결됨

fgenie commented 8 months ago

https://github.com/fgenie/rims_minimal/issues/18#issuecomment-1906423780

무결성 교차검증 완료 chatgpt on math