Open fgenie opened 7 months ago
SVAMP | GSM | OCW | MATH | |
---|---|---|---|---|
cot | 834/1000 (83.4%) | 1030/1319 (78.1%) | 27/272 (9.9%) | 1232/4996 (24.7%) |
pal | 841/1000 (84.1%) | 1054/1319 (79.9%) | 40/272 (14.7%) | 1704/4996 (34.1%) |
p2c | 833/1000 (83.3%) | 988/1319 (74.9%) | 33/272 (12.1%) | 1726/4996 (34.5%) |
is_equiv_ocw
cannot parse and check the answer from the provided prompt's answers. Why --> flawed parsing and equivalence logic. (surprisingly from the author's code)
OCW result re-measured with
is_equiv_ocw
modified tonormalize_symbolic_exp + is_equiv_exp
fromnormalize_final_answer + is_equiv_tex
27 / 272 (9.9%)
38 / 272 (14.0%)
36 / 272 (13.2%)
mostly the same results... I should check for each configuration
v3 | /ocw_gpt4turbo/ | /svamp_gpt4turbo/ | /gsm_gpt4turbo/ | |
---|---|---|---|---|
baseline | 60/272 (22.06 \%) | 953/1000 (95.30 \%) | 1256/1319 (95.22 \%) | |
rims | 63/272 (23.16 \%) | 954/1000 (95.40 \%) | 1263/1319 (95.75 \%) | |
rims_abl | 61/272 (22.43 \%) | 954/1000 (95.40 \%) | 1258/1319 (95.38 \%) |
v3
/ocw_gpt4turbo/ | /svamp_gpt4turbo/ | /gsm_gpt4turbo/ | |
---|---|---|---|
baseline | 7/128 (5.47 \%) | 1/14 (7.14 \%) | 7/22 (31.82 \%) |
rims | 10/128 (7.81 \%) | 2/14 (14.29 \%) | 14/22 (63.64 \%) |
rims_abl | 8/128 (6.25 \%) | 2/14 (14.29 \%) | 9/22 (40.91 \%) |
v3 | /ocw_gpt4turbo/ | /svamp_gpt4turbo/ | /gsm_gpt4turbo/ | |
---|---|---|---|---|
baseline | 20/272 | 5/1000 | 0/1319 | |
rims | 11/272 | 0/1000 | 0/1319 | |
rims_abl | 23/272 | 0/1000 | 0/1319 |
TLDR any( single < model selection < rims )
SVAMP | GSM | OCW | MATH | |
---|---|---|---|---|
cot | 919 / 1000 (91.9%) | 1210 / 1319 (91.7%) | TBA | TBA |
pal | 944 / 1000 (94.4%) | 1238 / 1319 (93.9%) | TBA | TBA |
p2c | 948 / 1000 (94.8%) | 1238 / 1319 (93.9%) | TBA | TBA |
Greedy Decoding 결과 (T=0, seed=777, chatgpt)
(선정) v3 prompt =
src/prompt_construction_src/prep_rims_prompts/gsm_prompts/ablation/4_ablate_3_reflectonce_p2c2cot.pal2p2c.pal2cot.txt_rm_ans
변경
프롬프트별 chatgpt (gpt-3.5-turbo-16k-0613) 결과
결과 TLDR
분석
(괄호) 는 고민
from papers
performance in total
v3
(Click to expand) Other prompts
## v2 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:--------------------|:---------------------|:------------------|:-----------------------| | baseline | 876/1000 (87.60 \%) | 1087/1319 (82.41 \%) | 36/272 (13.24 \%) | 1831/4996 (36.65 \%) | | rims | 884/1000 (88.40 \%) | 1106/1319 (83.85 \%) | 39/272 (14.34 \%) | 1875/4996 (37.53 \%) | | rims_turn | 880/1000 (88.00 \%) | 1091/1319 (82.71 \%) | 37/272 (13.60 \%) | nan | | rims_abl | 884/1000 (88.40 \%) | 1070/1319 (81.12 \%) | 42/272 (15.44 \%) | 1861/4996 (37.25 \%) | | rims_abl_turn | 882/1000 (88.20 \%) | 1106/1319 (83.85 \%) | 45/272 (16.54 \%) | nan | ## v1 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:--------------------|:---------------------|:------------------|:-----------------------| | baseline | 876/1000 (87.60 \%) | 1087/1319 (82.41 \%) | 36/272 (13.24 \%) | 1831/4996 (36.65 \%) | | rims | 880/1000 (88.00 \%) | 1101/1319 (83.47 \%) | 39/272 (14.34 \%) | 1954/4996 (39.11 \%) | | rims_turn | 888/1000 (88.80 \%) | 1119/1319 (84.84 \%) | 46/272 (16.91 \%) | nan | | rims_abl | 867/1000 (86.70 \%) | 1070/1319 (81.12 \%) | 35/272 (12.87 \%) | 1820/4996 (36.43 \%) | | rims_abl_turn | 877/1000 (87.70 \%) | 1089/1319 (82.56 \%) | 36/272 (13.24 \%) | nan |Model Select Success Rate
= conflict only results
v3
(Click to expand) Other prompts
## v2 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:-------------------|:------------------|:-----------------|:-----------------------| | baseline | 17/61 (27.87 \%) | 45/171 (26.32 \%) | 6/162 (3.70 \%) | 202/2260 (8.94 \%) | | rims | 25/61 (40.98 \%) | 64/171 (37.43 \%) | 9/162 (5.56 \%) | 246/2259 (10.89 \%) | | rims_turn | 21/61 (34.43 \%) | 49/171 (28.65 \%) | 7/162 (4.32 \%) | nan | | rims_abl | 25/61 (40.98 \%) | 28/171 (16.37 \%) | 12/162 (7.41 \%) | 232/2259 (10.27 \%) | | rims_abl_turn | 23/61 (37.70 \%) | 64/171 (37.43 \%) | 15/162 (9.26 \%) | nan | ## v1 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:-------------------|:------------------|:-----------------|:-----------------------| | baseline | 17/61 (27.87 \%) | 45/171 (26.32 \%) | 6/162 (3.70 \%) | 202/2260 (8.94 \%) | | rims | 21/61 (34.43 \%) | 59/171 (34.50 \%) | 9/162 (5.56 \%) | 325/2259 (14.39 \%) | | rims_turn | 29/61 (47.54 \%) | 77/171 (45.03 \%) | 16/162 (9.88 \%) | nan | | rims_abl | 8/61 (13.11 \%) | 28/171 (16.37 \%) | 5/162 (3.09 \%) | 191/2259 (8.46 \%) | | rims_abl_turn | 18/61 (29.51 \%) | 47/171 (27.49 \%) | 6/162 (3.70 \%) | nan |parsing failure count (llm fails)
turn
을 적용한 경우 혹은ablation
으로 reflection 과정을 blurb에서 제거하면 parsing fail (llm fails to generate properly) 이 증가.v3
(Click to expand) Other prompts
## v2 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:-------------------|:-----------------|:-----------------|:-----------------------| | baseline | 0/1000 | 0/1319 | 0/272 | 1/4996 | | rims | 0/1000 | 0/1319 | 5/272 | 20/4996 | | rims_turn | 1/1000 | 1/1319 | 27/272 | nan | | rims_abl | 2/1000 | 0/1319 | 11/272 | 109/4996 | | rims_abl_turn | 5/1000 | 5/1319 | 48/272 | nan | ## v1 | | /svamp_0613long/ | /gsm_0613long/ | /ocw_0613long/ | /math_full_0613long/ | |:--------------|:-------------------|:-----------------|:-----------------|:-----------------------| | baseline | 0/1000 | 0/1319 | 0/272 | 1/4996 | | rims | 0/1000 | 0/1319 | 0/272 | 18/4996 | | rims_turn | 0/1000 | 2/1319 | 21/272 | nan | | rims_abl | 0/1000 | 0/1319 | 10/272 | 33/4996 | | rims_abl_turn | 2/1000 | 2/1319 | 17/272 | nan |