fgenie / rims_minimal

시작이 절반이고 마무리 또한 절반이다.
0 stars 1 forks source link

최최최종 결과 모으기 #35

Open fgenie opened 7 months ago

fgenie commented 7 months ago

Greedy Decoding 결과 (T=0, seed=777, chatgpt)

(선정) v3 prompt = src/prompt_construction_src/prep_rims_prompts/gsm_prompts/ablation/4_ablate_3_reflectonce_p2c2cot.pal2p2c.pal2cot.txt_rm_ans

변경

프롬프트별 chatgpt (gpt-3.5-turbo-16k-0613) 결과

결과 TLDR

분석

(괄호) 는 고민

from papers

performance in total

v3

/svamp_0613long/ /gsm_0613long/ /ocw_0613long/ /math_full_0613long/
baseline 876/1000 (87.60 \%) 1087/1319 (82.41 \%) 36/272 (13.24 \%) 1831/4996 (36.65 \%)
rims 886/1000 (88.60 \%) 1114/1319 (84.46 \%) 39/272 (14.34 \%) 1932/4996 (38.67 \%)
rims_turn 881/1000 (88.10 \%) 1108/1319 (84.00 \%) 35/272 (12.87 \%) nan
rims_abl 877/1000 (87.70 \%) 1083/1319 (82.11 \%) 33/272 (12.13 \%) 1769/4996 (35.41 \%)
rims_abl_turn 860/1000 (86.00 \%) 1052/1319 (79.76 \%) 33/272 (12.13 \%) nan
(Click to expand) Other prompts ## v2 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:--------------------|:---------------------|:------------------|:-----------------------| | baseline | 876/1000 (87.60 \%) | 1087/1319 (82.41 \%) | 36/272 (13.24 \%) | 1831/4996 (36.65 \%) | | rims | 884/1000 (88.40 \%) | 1106/1319 (83.85 \%) | 39/272 (14.34 \%) | 1875/4996 (37.53 \%) | | rims_turn | 880/1000 (88.00 \%) | 1091/1319 (82.71 \%) | 37/272 (13.60 \%) | nan | | rims_abl | 884/1000 (88.40 \%) | 1070/1319 (81.12 \%) | 42/272 (15.44 \%) | 1861/4996 (37.25 \%) | | rims_abl_turn | 882/1000 (88.20 \%) | 1106/1319 (83.85 \%) | 45/272 (16.54 \%) | nan | ## v1 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:--------------------|:---------------------|:------------------|:-----------------------| | baseline | 876/1000 (87.60 \%) | 1087/1319 (82.41 \%) | 36/272 (13.24 \%) | 1831/4996 (36.65 \%) | | rims | 880/1000 (88.00 \%) | 1101/1319 (83.47 \%) | 39/272 (14.34 \%) | 1954/4996 (39.11 \%) | | rims_turn | 888/1000 (88.80 \%) | 1119/1319 (84.84 \%) | 46/272 (16.91 \%) | nan | | rims_abl | 867/1000 (86.70 \%) | 1070/1319 (81.12 \%) | 35/272 (12.87 \%) | 1820/4996 (36.43 \%) | | rims_abl_turn | 877/1000 (87.70 \%) | 1089/1319 (82.56 \%) | 36/272 (13.24 \%) | nan |

Model Select Success Rate

= conflict only results

v3

/svamp_0613long/ /gsm_0613long/ /ocw_0613long/ /math_full_0613long/
baseline 17/61 (27.87 \%) 45/171 (26.32 \%) 6/162 (3.70 \%) 202/2260 (8.94 \%)
rims 27/61 (44.26 \%) 72/171 (42.11 \%) 9/162 (5.56 \%) 303/2259 (13.41 \%)
rims_turn 22/61 (36.07 \%) 66/171 (38.60 \%) 5/162 (3.09 \%) nan
rims_abl 18/61 (29.51 \%) 41/171 (23.98 \%) 3/162 (1.85 \%) 140/2259 (6.20 \%)
rims_abl_turn 1/61 (1.64 \%) 10/171 (5.85 \%) 3/162 (1.85 \%) nan
(Click to expand) Other prompts ## v2 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:-------------------|:------------------|:-----------------|:-----------------------| | baseline | 17/61 (27.87 \%) | 45/171 (26.32 \%) | 6/162 (3.70 \%) | 202/2260 (8.94 \%) | | rims | 25/61 (40.98 \%) | 64/171 (37.43 \%) | 9/162 (5.56 \%) | 246/2259 (10.89 \%) | | rims_turn | 21/61 (34.43 \%) | 49/171 (28.65 \%) | 7/162 (4.32 \%) | nan | | rims_abl | 25/61 (40.98 \%) | 28/171 (16.37 \%) | 12/162 (7.41 \%) | 232/2259 (10.27 \%) | | rims_abl_turn | 23/61 (37.70 \%) | 64/171 (37.43 \%) | 15/162 (9.26 \%) | nan | ## v1 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:-------------------|:------------------|:-----------------|:-----------------------| | baseline | 17/61 (27.87 \%) | 45/171 (26.32 \%) | 6/162 (3.70 \%) | 202/2260 (8.94 \%) | | rims | 21/61 (34.43 \%) | 59/171 (34.50 \%) | 9/162 (5.56 \%) | 325/2259 (14.39 \%) | | rims_turn | 29/61 (47.54 \%) | 77/171 (45.03 \%) | 16/162 (9.88 \%) | nan | | rims_abl | 8/61 (13.11 \%) | 28/171 (16.37 \%) | 5/162 (3.09 \%) | 191/2259 (8.46 \%) | | rims_abl_turn | 18/61 (29.51 \%) | 47/171 (27.49 \%) | 6/162 (3.70 \%) | nan |

parsing failure count (llm fails)

v3

/svamp_0613long/ /gsm_0613long/ /ocw_0613long/ /math_full_0613long/
baseline 0/1000 0/1319 0/272 1/4996
rims 1/1000 0/1319 2/272 53/4996
rims_turn 6/1000 11/1319 49/272 nan
rims_abl 1/1000 0/1319 6/272 39/4996
rims_abl_turn 3/1000 3/1319 12/272 nan
(Click to expand) Other prompts ## v2 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:-------------------|:-----------------|:-----------------|:-----------------------| | baseline | 0/1000 | 0/1319 | 0/272 | 1/4996 | | rims | 0/1000 | 0/1319 | 5/272 | 20/4996 | | rims_turn | 1/1000 | 1/1319 | 27/272 | nan | | rims_abl | 2/1000 | 0/1319 | 11/272 | 109/4996 | | rims_abl_turn | 5/1000 | 5/1319 | 48/272 | nan | ## v1 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:-------------------|:-----------------|:-----------------|:-----------------------| | baseline | 0/1000 | 0/1319 | 0/272 | 1/4996 | | rims | 0/1000 | 0/1319 | 0/272 | 18/4996 | | rims_turn | 0/1000 | 2/1319 | 21/272 | nan | | rims_abl | 0/1000 | 0/1319 | 10/272 | 33/4996 | | rims_abl_turn | 2/1000 | 2/1319 | 17/272 | nan |
fgenie commented 7 months ago

single method results

SVAMP GSM OCW MATH
cot 834/1000 (83.4%) 1030/1319 (78.1%) 27/272 (9.9%) 1232/4996 (24.7%)
pal 841/1000 (84.1%) 1054/1319 (79.9%) 40/272 (14.7%) 1704/4996 (34.1%)
p2c 833/1000 (83.3%) 988/1319 (74.9%) 33/272 (12.1%) 1726/4996 (34.5%)

is_equiv_ocw cannot parse and check the answer from the provided prompt's answers. Why --> flawed parsing and equivalence logic. (surprisingly from the author's code)

OCW result re-measured with is_equiv_ocw modified to normalize_symbolic_exp + is_equiv_exp from normalize_final_answer + is_equiv_tex

cot

27 / 272 (9.9%)

pal

38 / 272 (14.0%)

p2c

36 / 272 (13.2%)

mostly the same results... I should check for each configuration

fgenie commented 7 months ago

GPT4TURBO

total performance

v3 /ocw_gpt4turbo/ /svamp_gpt4turbo/ /gsm_gpt4turbo/
baseline 60/272 (22.06 \%) 953/1000 (95.30 \%) 1256/1319 (95.22 \%)
rims 63/272 (23.16 \%) 954/1000 (95.40 \%) 1263/1319 (95.75 \%)
rims_abl 61/272 (22.43 \%) 954/1000 (95.40 \%) 1258/1319 (95.38 \%)

model selection success rate

v3

/ocw_gpt4turbo/ /svamp_gpt4turbo/ /gsm_gpt4turbo/
baseline 7/128 (5.47 \%) 1/14 (7.14 \%) 7/22 (31.82 \%)
rims 10/128 (7.81 \%) 2/14 (14.29 \%) 14/22 (63.64 \%)
rims_abl 8/128 (6.25 \%) 2/14 (14.29 \%) 9/22 (40.91 \%)

failure rate

v3 /ocw_gpt4turbo/ /svamp_gpt4turbo/ /gsm_gpt4turbo/
baseline 20/272 5/1000 0/1319
rims 11/272 0/1000 0/1319
rims_abl 23/272 0/1000 0/1319
fgenie commented 7 months ago

GPT4TURBO single method results

TLDR any( single < model selection < rims )

SVAMP GSM OCW MATH
cot 919 / 1000 (91.9%) 1210 / 1319 (91.7%) TBA TBA
pal 944 / 1000 (94.4%) 1238 / 1319 (93.9%) TBA TBA
p2c 948 / 1000 (94.8%) 1238 / 1319 (93.9%) TBA TBA