Open fgenie opened 5 months ago
math-full results are incomplete for now (5000 in total, only has 357 rows of outputs) stroke = worse than model selection baseline all greedy decoding (T=0, seed=777)
GSM | SVAMP | OCW | |
---|---|---|---|
cot | 1034/1319 (78.4%) | 815/1000 (81.5%) | 28/272 (10.3%) |
pal | 1027/1319 (77.9%) | 844/1000 (84.4%) | 30/272 (11.0%) |
p2c | 1034/1319 (78.4%) | 826/1000 (82.6%) | 50/272 (18.4%) |
/svamp/ | /gsm/ | /math_full/ | /ocw/ | |
---|---|---|---|---|
baseline | 879/1000 (87.90 \%) | 1102/1319 (83.55 \%) | 194/357 (54.34 \%) | 47/272 (17.28 \%) |
rims | 1121/1319 (84.99 \%) | 198/357 (55.46 \%) | 51/272 (18.75 \%) | |
rims_turn | 883/1000 (88.30 \%) | 1122/1319 (85.06 \%) | 198/357 (55.46 \%) | 47/272 (17.28 \%) |
rims_abl | 873/1000 (87.30 \%) | 1088/1319 (82.49 \%) | 189/357 (52.94 \%) | 46/272 (16.91 \%) |
rims_abl_turn | 853/1000 (85.30 \%) | 1061/1319 (80.44 \%) | 173/357 (48.46 \%) | 42/272 (15.44 \%) |
/svamp/ | /gsm/ | /math_full/ | /ocw/ | |
---|---|---|---|---|
baseline | 879/1000 (87.90 \%) | 1102/1319 (83.55 \%) | 194/357 (54.34 \%) | 47/272 (17.28 \%) |
rims | 879/1000 (87.90 \%) | 1117/1319 (84.69 \%) | 199/357 (55.74 \%) | 51/272 (18.75 \%) |
rims_turn | 1110/1319 (84.15 \%) | |||
rims_abl | 871/1000 (87.10 \%) | 1095/1319 (83.02 \%) | 190/357 (53.22 \%) | 47/272 (17.28 \%) |
rims_abl_turn | 853/1000 (85.30 \%) | 1062/1319 (80.52 \%) | 175/357 (49.02 \%) | 44/272 (16.18 \%) |
/svamp/ | /gsm/ | /math_full/ | /ocw/ | |
---|---|---|---|---|
baseline | 879/1000 (87.90 \%) | 1102/1319 (83.55 \%) | 194/357 (54.34 \%) | 47/272 (17.28 \%) |
rims | 1122/1319 (85.06 \%) | 203/357 (56.86 \%) | 53/272 (19.49 \%) | |
rims_turn | 884/1000 (88.40 \%) | 1124/1319 (85.22 \%) | 54/272 (19.85 \%) | |
rims_abl | 877/1000 (87.70 \%) | 1105/1319 (83.78 \%) | 199/357 (55.74 \%) | 47/272 (17.28 \%) |
rims_abl_turn | 854/1000 (85.40 \%) | 1058/1319 (80.21 \%) | 173/357 (48.46 \%) | 42/272 (15.44 \%) |
/svamp/ | /gsm/ | /math_full/ | /ocw/ | |
---|---|---|---|---|
baseline | 26/65 (40.00 \%) | 48/168 (28.57 \%) | 22/146 (15.07 \%) | 5/124 (4.03 \%) |
rims | 67/168 (39.88 \%) | 26/146 (17.81 \%) | 9/124 (7.26 \%) | |
rims_turn | 30/65 (46.15 \%) | 68/168 (40.48 \%) | 26/146 (17.81 \%) | |
rims_abl | 20/65 (30.77 \%) | 34/168 (20.24 \%) | 17/146 (11.64 \%) | 4/124 (3.23 \%) |
rims_abl_turn | 0/65 (0.00 \%) | 7/168 (4.17 \%) | 1/146 (0.68 \%) | 0/124 (0.00 \%) |
/svamp/ | /gsm/ | /math_full/ | /ocw/ | |
---|---|---|---|---|
baseline | 26/65 (40.00 \%) | 48/168 (28.57 \%) | 22/146 (15.07 \%) | 5/124 (4.03 \%) |
rims | 26/65 (40.00 \%) | 63/168 (37.50 \%) | 27/146 (18.49 \%) | 9/124 (7.26 \%) |
rims_turn | 56/168 (33.33 \%) | |||
rims_abl | 18/65 (27.69 \%) | 41/168 (24.40 \%) | 18/146 (12.33 \%) | 5/124 (4.03 \%) |
rims_abl_turn | 0/65 (0.00 \%) | 8/168 (4.76 \%) | 3/146 (2.05 \%) | 2/124 (1.61 \%) |
/svamp/ | /gsm/ | /math_full/ | /ocw/ | |
---|---|---|---|---|
baseline | 26/65 (40.00 \%) | 48/168 (28.57 \%) | 22/146 (15.07 \%) | 5/124 (4.03 \%) |
rims | 68/168 (40.48 \%) | 31/146 (21.23 \%) | 11/124 (8.87 \%) | |
rims_turn | 31/65 (47.69 \%) | 70/168 (41.67 \%) | 12/124 (9.68 \%) | |
rims_abl | 24/65 (36.92 \%) | 51/168 (30.36 \%) | 27/146 (18.49 \%) | 5/124 (4.03 \%) |
rims_abl_turn | 1/65 (1.54 \%) | 4/168 (2.38 \%) | 1/146 (0.68 \%) | 0/124 (0.00 \%) |
DEPRECATE: rims 는 16k-0613 chatgpt, / 나머지는 0613 chatgpt 라서 꼬투리잡힐 것 같음. 맨 아래에 모두다 16k로 통일한 결과 첨부.
아래 모든 실험에 적용되는 옵션: greedy decoding
Performance in total
Only one method
Conflict only -- choice success rate
ablation (rims - reflection)
TBA
MATH, OCW type-wise
TBA