fgenie / rims_minimal

시작이 절반이고 마무리 또한 절반이다.
0 stars 1 forks source link

[OCW 결과] #31

Closed fgenie closed 7 months ago

fgenie commented 8 months ago

같은 gsm 유래 prompt를 사용했다. prompt = prompt_construction_src/prep_rims_prompts/gsm_prompts/3_reflectonce_cot2p2c.pal2cot.pal2p2c.txt_rm_ans

chatgpt

baseline

total: 35 / 272 (12.87 \%) fail: 0 / 272 nonconflict: 30 / 102 conflict: 5 / 170 (2.94 \%)

rims

total: 44 / 272 (16.18 \%) fail: 3 / 272 nonconflict: 30 / 102 conflict: 14 / 167 (8.24 \%, 170 as divisor)


gpt4turbo (preview1106)

fgenie commented 8 months ago

baseline

total: 60 / 272 (22.06 \%) fail: 0 / 272 nonconflict: 53 / 144 conflict: 7 / 128 (5.47 \%)

rims

total: 67 / 272 (52.34 \%) fail: 18 / 272 nonconflict: 53 / 144 conflict: 14 / 110 (10.94 \%)

failure가 18개인 것이 눈에 띄지만, 이를 오답처리하고도 baseline보다 많이 맞추고 있어 긍정적입니다.