fgenie / rims_minimal

Been lazy enough to pull over again to the end!
0 stars 1 forks source link

To appendix: [합친 결과 테이블] #33

Open fgenie opened 5 months ago

fgenie commented 5 months ago

DEPRECATE: rims 는 16k-0613 chatgpt, / 나머지는 0613 chatgpt 라서 꼬투리잡힐 것 같음. 맨 아래에 모두다 16k로 통일한 결과 첨부.

아래 모든 실험에 적용되는 옵션: greedy decoding

Performance in total

Performance MATH-prealgebra OCWcourse GSM8K SVAMP
gpt-3.5-turbo-0613 baseline 501/871 (57.52 %) 35/272 (12.87%) 1099 / 1319 (83.32%) 877/1000 (87.7%)
rims 540/871 (62.00 %) 44/272 (16.18%) 1122 / 1319 (85.56%) 883/1000 (88.3%)
gpt-4-1106-preview baseline 619/871 (71.07 %) 60/272 (22.06%) 1256 / 1319 (95.22%) 953/1000 (95.3%)
rims 639/871 (73.36%) 67/272 (24.63%) 1259 / 1319 (95.45%) 953/1000 (95.3%)

Only one method

Performance MATH-prealgebra OCWcourse GSM8K SVAMP
gpt-3.5-turbo-0613 cot 364 / 871 (41.8%) 25 / 272 (9.2%) 1037 / 1319 (78.6%) 830 / 1000 (83.0%)
pal 470 / 871 (54.0%) 42 / 272 (15.4%) 1061 / 1319 (80.4%) 841 / 1000 (84.1%)
p2c 457 / 871 (52.5%) 33 / 272 (12.1%) 1003 / 1319 (76.0%) 835 / 1000 (83.5%)
gpt-4-1106-preview cot 482 / 871 (55.3%) 42 / 272 (15.4%) 1210 / 1319 (91.7%) 919 / 1000 (91.9%)
pal 577 / 871 (66.2%) 37 / 272 (13.6%) 1238 / 1319 (93.9%) 944 / 1000 (94.4%)
p2c 589 / 871 (67.6%) 80 / 272 (29.4%) 1238 / 1319 (93.9%) 948 / 1000 (94.8%)

Conflict only -- choice success rate

Success rate MATH-prealgebra OCWcourse GSM8K SVAMP
gpt-3.5-turbo-0613 baseline 29/230 (12.61 %) 5 / 170 (2.94 %) 40 / 159 (25.16%) 18 / 62 (29.03%)
rims 68 / 229 (29.69 %) 14 / 170 (8.24 %) 63 / 159 (39.62%) 24 / 62 (38.71%)
gpt-4-1106-preview baseline 6 / 124 (4.84 %)
7 / 128 (5.47 %) 7 / 22 (31.82%) 1 / 14 (7.14%)
rims 26 / 124 (20.96%) 14 / 128 (10.94 %) 10 / 22 (45.45 %) 1 / 14 (7.14%)

ablation (rims - reflection)

TBA

MATH, OCW type-wise

TBA

fgenie commented 5 months ago

chatgpt 1106 --> To Appendix

math-full results are incomplete for now (5000 in total, only has 357 rows of outputs) stroke = worse than model selection baseline all greedy decoding (T=0, seed=777)

TL;DR

only one method

GSM SVAMP OCW
cot 1034/1319 (78.4%) 815/1000 (81.5%) 28/272 (10.3%)
pal 1027/1319 (77.9%) 844/1000 (84.4%) 30/272 (11.0%)
p2c 1034/1319 (78.4%) 826/1000 (82.6%) 50/272 (18.4%)

performance

prompt v1 (current)

/svamp/ /gsm/ /math_full/ /ocw/
baseline 879/1000 (87.90 \%) 1102/1319 (83.55 \%) 194/357 (54.34 \%) 47/272 (17.28 \%)
rims 876/1000 (87.60 \%) 1121/1319 (84.99 \%) 198/357 (55.46 \%) 51/272 (18.75 \%)
rims_turn 883/1000 (88.30 \%) 1122/1319 (85.06 \%) 198/357 (55.46 \%) 47/272 (17.28 \%)
rims_abl 873/1000 (87.30 \%) 1088/1319 (82.49 \%) 189/357 (52.94 \%) 46/272 (16.91 \%)
rims_abl_turn 853/1000 (85.30 \%) 1061/1319 (80.44 \%) 173/357 (48.46 \%) 42/272 (15.44 \%)

prompt v2

/svamp/ /gsm/ /math_full/ /ocw/
baseline 879/1000 (87.90 \%) 1102/1319 (83.55 \%) 194/357 (54.34 \%) 47/272 (17.28 \%)
rims 879/1000 (87.90 \%) 1117/1319 (84.69 \%) 199/357 (55.74 \%) 51/272 (18.75 \%)
rims_turn 873/1000 (87.30 \%) 1110/1319 (84.15 \%) 190/357 (53.22 \%) 45/272 (16.54 \%)
rims_abl 871/1000 (87.10 \%) 1095/1319 (83.02 \%) 190/357 (53.22 \%) 47/272 (17.28 \%)
rims_abl_turn 853/1000 (85.30 \%) 1062/1319 (80.52 \%) 175/357 (49.02 \%) 44/272 (16.18 \%)

prompt v3

/svamp/ /gsm/ /math_full/ /ocw/
baseline 879/1000 (87.90 \%) 1102/1319 (83.55 \%) 194/357 (54.34 \%) 47/272 (17.28 \%)
rims 877/1000 (87.70 \%) 1122/1319 (85.06 \%) 203/357 (56.86 \%) 53/272 (19.49 \%)
rims_turn 884/1000 (88.40 \%) 1124/1319 (85.22 \%) 189/357 (52.94 \%) 54/272 (19.85 \%)
rims_abl 877/1000 (87.70 \%) 1105/1319 (83.78 \%) 199/357 (55.74 \%) 47/272 (17.28 \%)
rims_abl_turn 854/1000 (85.40 \%) 1058/1319 (80.21 \%) 173/357 (48.46 \%) 42/272 (15.44 \%)

success rate

v1

/svamp/ /gsm/ /math_full/ /ocw/
baseline 26/65 (40.00 \%) 48/168 (28.57 \%) 22/146 (15.07 \%) 5/124 (4.03 \%)
rims 23/65 (35.38 \%) 67/168 (39.88 \%) 26/146 (17.81 \%) 9/124 (7.26 \%)
rims_turn 30/65 (46.15 \%) 68/168 (40.48 \%) 26/146 (17.81 \%) 5/124 (4.03 \%)
rims_abl 20/65 (30.77 \%) 34/168 (20.24 \%) 17/146 (11.64 \%) 4/124 (3.23 \%)
rims_abl_turn 0/65 (0.00 \%) 7/168 (4.17 \%) 1/146 (0.68 \%) 0/124 (0.00 \%)

v2

/svamp/ /gsm/ /math_full/ /ocw/
baseline 26/65 (40.00 \%) 48/168 (28.57 \%) 22/146 (15.07 \%) 5/124 (4.03 \%)
rims 26/65 (40.00 \%) 63/168 (37.50 \%) 27/146 (18.49 \%) 9/124 (7.26 \%)
rims_turn 20/65 (30.77 \%) 56/168 (33.33 \%) 18/146 (12.33 \%) 3/124 (2.42 \%)
rims_abl 18/65 (27.69 \%) 41/168 (24.40 \%) 18/146 (12.33 \%) 5/124 (4.03 \%)
rims_abl_turn 0/65 (0.00 \%) 8/168 (4.76 \%) 3/146 (2.05 \%) 2/124 (1.61 \%)

v3

/svamp/ /gsm/ /math_full/ /ocw/
baseline 26/65 (40.00 \%) 48/168 (28.57 \%) 22/146 (15.07 \%) 5/124 (4.03 \%)
rims 24/65 (36.92 \%) 68/168 (40.48 \%) 31/146 (21.23 \%) 11/124 (8.87 \%)
rims_turn 31/65 (47.69 \%) 70/168 (41.67 \%) 17/146 (11.64 \%) 12/124 (9.68 \%)
rims_abl 24/65 (36.92 \%) 51/168 (30.36 \%) 27/146 (18.49 \%) 5/124 (4.03 \%)
rims_abl_turn 1/65 (1.54 \%) 4/168 (2.38 \%) 1/146 (0.68 \%) 0/124 (0.00 \%)