fgenie / rims_minimal

Been lazy enough to pull over again to the end!
0 stars 1 forks source link

[GSM결과 기록] prealgebra 실험하면서 많은 디버깅을 거쳤기 때문에 다시 측정 #22

Closed fgenie closed 4 months ago

fgenie commented 5 months ago

chatgpt

baseline

total: 1099 / 1319 fail: 0 / 1319 nonconflict: 1059 / 1160 conflict: 40 / 159

83.32 %

rims

total: 1122 / 1319 fail: 0 / 1319 nonconflict: 1059 / 1160 conflict: 63 / 159

85.06 %

gpt4turbo

baseline

total: 1256 / 1319 fail: 0 / 1319 nonconflict: 1249 / 1297 conflict: 7 / 22

95.22%

rims

total: 1259 / 1319 fail: 0 / 1319 nonconflict: 1249 / 1297 conflict: 10 / 22

95.45%

fgenie commented 5 months ago

chatGPT RESULTS

GSM

cot

1037 / 1319 (78.6%)

pal

1061 / 1319 (80.4%)

p2c

1003 / 1319 (76.0%)

fgenie commented 5 months ago

GPT4TURBO RESULTS

GSM

cot

1210 / 1319 (91.7%)

pal

1238 / 1319 (93.9%)

p2c

1238 / 1319 (93.9%)