Open fgenie opened 9 months ago
지금 rims prompt 는 아래 둘을 합쳐놓은 방법이다.
이것들을 드러내려면 어떤 것들을 살펴볼 수 있을까?
그 외에 부가적 분석으로
어떤 문제를 풀 수 있는 방법이 한가지로 특정되는 상황에서 rims가 잘 되는경향이 있다면 최고 좋긴 한데
생각보다 cot pal이 어떤 경우에나 적용이 가능한 것 같아 보이기도 한다. 특정 형태에 대해 불가능을 상정하고 실험을 해야하는걸까
both RIMS / model-selection-reasoning baseline prompts from GSM examples utterly fails on symbolic reasoning. (=v3 prompt, currently best-performing prompt) prompt_construction_src/prep_rims_prompts/gsm_prompts/3_reflectonce_p2c2cot.pal2p2c.pal2cot.txt_rm_ans
chatgpt0613long | numeric | symbolic |
---|---|---|
non-conflict | 30/82 | 0/28 |
conflict (baseline) | 6/109 | 0/53 |
conflict (rims) | 9/108 | 0/52 |
*(missing 1 numeric and 1 symbolic = / in rims results are failed to parse)
gpt4turbo | numeric | symbolic |
---|---|---|
non-conflict | 52/105 | 1/39 |
conflict (baseline) | 7/72 | 0/36 |
conflict (rims) | 10/76 | 0/41 |
(missing 14 numerics and 6 symbolics in baseline / 20 missings are failed to parse ) (missing 10 numerics and 1 symbolics in rims / 11 are failed to parse )
OCW type numeric 191 expression (tex) 69 equation 12
In short, type-wise analyses shows,
RIMS v3 >= model-selection-baseline ~ PAL, P2C > CoT
but in aggregate,
RIMS v3 >= model-selection-baseline > PAL, P2C, CoT
significantly improves results in
has no effect in
elsewhere, RIMS v3 somewhat improves the performance over baselines
No single method surpasses feedback method, while in some, PAL or P2C surpasses model-selection baseline**
MATH-types | method | in-total perf. (cot;pal;p2c) (method+majority_vote) |
% | conflict (method) |
% | non-conflict (majority_vote) |
% |
---|---|---|---|---|---|---|---|
Geometry | baseline | 102 (91;79;81) / 479 | 21.3% (19.0;16.5;16.9 %) | 28 / 275 | 10.2% | 74 / 544 | 13.6% |
rims | 102 / 479 | 21.3% | 28 / 275 | 10.2%* | |||
Number Theory | baseline | 279 (127;303;294) / 540 | 51.7% (23.5;56.1;54.4 %) | 10 / 208 | 4.8% | 269 / 544 | 49.4% |
rims | 314 / 540 | 58.1% | 45 / 208 | 21.6% | |||
Prealgebra | baseline | 505 (369;473;463) / 871 | 58.0% (42.4;54.3;53.2 %) | 35 / 247 | 14.2% | 470 / 544 | 86.4% |
rims | 529 / 871 | 60.7% | 59 / 247 | 23.9% | |||
Algebra | baseline | 585 (423;508;551) / 1187 | 49.3% (35.6;42.7;46.4 %) | 78 / 481 | 16.2% | 507 / 544 | 93.2% |
rims | 606 / 1187 | 51.1% | 99 / 481 | 20.6% | |||
Counting & Probability | baseline | 169 (106;184;163) / 474 | 35.7% (22.4;38.8;34.4 %) | 13 / 223 | 5.8% | 156 / 544 | 28.7% |
rims | 182 / 474 | 38.4% | 26 / 223 | 11.7% | |||
Intermediate Algebra | baseline | 137 (81;116;129) / 901 | 15.2% (9.0;12.8;14.3 %) | 27 / 517 | 5.2% | 110 / 544 | 20.2% |
rims | 143 / 901 | 15.9% | 33 / 516 | 6.4%* | |||
Precalculus | baseline | 54 (35;41;45) / 544 | 9.9% (6.4;7.5;8.3 %) | 11 / 309 | 3.6% | 43 / 544 | 7.9% |
rims | 56 / 544 | 10.3% | 13 / 309 | 4.2%* |
/math_full_0613long/ | |
---|---|
baseline | 1831/4996 (36.65 \%) |
rims | 1932/4996 (38.67 \%) |
rims_abl | 1769/4996 (35.41 \%) |
cot | 1232/4996 (24.7%) |
pal | 1704/4996 (34.1%) |
p2c | 1726/4996 (34.5%) |
Level | method | Conflict (method) |
Non-Conflict (majority_vote) |
In-Total (method+majority_vote) |
COT | PAL | P2C |
---|---|---|---|---|---|---|---|
1 | Baseline | 14/90 (15.6%) | 281/1324 (21.2%) | 295/437 (67.5%) | 229/437 (52.4%) | 269/437 (61.6%) | 285/437 (65.2%) |
1 | Rims | 20/90 (22.2%) | - | 301/437 (68.9%) | - | - | - |
2 | Baseline | 38/292 (13.0%) | 439/1324 (33.2%) | 477/893 (53.4%) | 344/893 (38.5%) | 436/893 (48.8%) | 456/893 (51.1%) |
2 | Rims | 59/292 (20.2%) | - | 498/893 (55.8%) | - | - | - |
3 | Baseline | 42/483 (8.7%) | 425/1324 (32.1%) | 467/1128 (41.4%) | 300/1128 (26.6%) | 454/1128 (40.2%) | 438/1128 (38.8%) |
3 | Rims | 87/483 (18.0%) | - | 512/1128 (45.4%) | - | - | - |
4 | Baseline | 54/591 (9.1%) | 321/1324 (24.2%) | 375/1214 (30.9%) | 235/1214 (19.4%) | 339/1214 (27.9%) | 344/1214 (28.3%) |
4 | Rims | 76/591 (12.9%) | - | 397/1214 (32.7%) | - | - | - |
5 | Baseline | 54/804 (6.7%) | 163/1324 (12.3%) | 217/1324 (16.4%) | 124/1324 (9.4%) | 206/1324 (15.6%) | 203/1324 (15.3%) |
5 | Rims | 61/803 (7.6%) | - | 224/1324 (16.9%) | - | - | - |
구성비로 설명이 되는 것 같진 않음
Level 1 | Level 2 | Level 3 | Level 4 | Level 5 | |
---|---|---|---|---|---|
Algebra | 135 (30.89%) | 201 (22.51%) | 261 (23.14%) | 283 (23.31%) | 307 (23.19%) |
Counting & Probability | 39 (8.92%) | 101 (11.31%) | 100 (8.87%) | 111 (9.14%) | 123 (9.29%) |
Geometry | 38 (8.7%) | 82 (9.18%) | 102 (9.04%) | 125 (10.3%) | 132 (9.97%) |
Intermediate Algebra | 52 (11.9%) | 128 (14.33%) | 193 (17.11%) | 248 (20.43%) | 280 (21.15%) |
Number Theory | 30 (6.86%) | 92 (10.3%) | 122 (10.82%) | 142 (11.7%) | 154 (11.63%) |
Prealgebra | 86 (19.68%) | 177 (19.82%) | 224 (19.86%) | 191 (15.73%) | 193 (14.58%) |
Precalculus | 57 (13.04%) | 112 (12.54%) | 126 (11.17%) | 114 (9.39%) | 135 (10.2%) |
방법별 겹치는 정도에 대한 해석은 다시 생각해봐야할 것 같네요. 수정후에 다시 공유드리겠습니다.
MATH에서 문제 유형별로 각 방법이 얼마나 맞추는지, 그 중 unique하게 풀 수 있는 비율은 얼마인지 세보았습니다. 결론은, 정답률이 높은 유형일 수록 uniqueness는 떨어집니다.
# Interpretation & plan
그래서 현재의 cot, pal, p2c (with gsm examples) 는 서로 완전히 동치는 아닙니다만
방법별 10% 내외의 정답률 차이를 각 방법의 이점이라 생각한다면 이에 대해서 방법 선택을 하는 빈도를 세볼 수 있기는 합니다.
방법별 차이를 더 두드러지게 하고 싶다면 다음의 실행적 방안이 있습니다.
MATH = 4996 rows
selection effect | feedback_effect | in total | |
---|---|---|---|
model_selection | 202 (4.0 %p) | 0 | 202 |
rims | 95 (1.9 %p) | 208 (4.2 %p) | 303 (6.1 %p) |
upperbound | 722 | - | 722 (14.5 %p) |
ocw evaluation 이 이상하다.
is_equiv_ocw
cannot parse and check the answer from the provided prompt's answers. Why --> flawed parsing and equivalence logic. (surprisingly from the author's code)
OCW result re-measured with
is_equiv_ocw
modified tonormalize_symbolic_exp + is_equiv_exp
fromnormalize_final_answer + is_equiv_tex
27 / 272 (9.9%)
38 / 272 (14.0%)
36 / 272 (13.2%)
mostly the same results... this does not explain anything about my modification to the original eval code single-handedly. I should check for each configuration
어제 Azure endpoint 태형님이랑 확인한거, 그리고 ocw scoring 함수 고쳐서 테스트해보았습니다.
// eval_new correct (3)
{"answer": "x_{0} \\cos (\\omega t)+$ $\\dot{x}_{0} \\sin (\\omega t) / \\omega", "artificial_wrong": "1+x_{0} \\cos (\\omega t)+$ $\\dot{x}_{0} \\sin (\\omega t) / \\omega", "eval": true, "eval_new": false}
{"answer": "\\frac{1}{b-a}\\left(e^{-a t}-e^{-b t}\\right)", "artificial_wrong": "1+\\frac{1}{b-a}\\left(e^{-a t}-e^{-b t}\\right)", "eval": "EVAL_FAIL! cannot determine truth value of Relational", "eval_new": false}
{"answer": "m_{p} c^{2}\\left(\\gamma^{2}-1\\right) \\sin ^{2} \\theta", "artificial_wrong": "1+m_{p} c^{2}\\left(\\gamma^{2}-1\\right) \\sin ^{2} \\theta", "eval": "EVAL_FAIL! cannot determine truth value of Relational", "eval_new": false}
// both wrong (42)
{"answer": "4.5e33", "artificial_wrong": "1+4.5e33", "eval": true, "eval_new": true}
{"answer": "3.83e35", "artificial_wrong": "1+3.83e35", "eval": true, "eval_new": true}
{"answer": "8.7e8", "artificial_wrong": "1+8.7e8", "eval": true, "eval_new": true}
{"answer": "4e33", "artificial_wrong": "1+4e33", "eval": true, "eval_new": true}
{"answer": "3.3e12", "artificial_wrong": "1+3.3e12", "eval": true, "eval_new": true}
{"answer": "3e6", "artificial_wrong": "1+3e6", "eval": true, "eval_new": true}
{"answer": "7e37", "artificial_wrong": "1+7e37", "eval": true, "eval_new": true}
{"answer": "7.5e7", "artificial_wrong": "1+7.5e7", "eval": true, "eval_new": true}
{"answer": "2e27", "artificial_wrong": "1+2e27", "eval": true, "eval_new": true}
{"answer": "2.75e11", "artificial_wrong": "1+2.75e11", "eval": true, "eval_new": true}
{"answer": "6e13", "artificial_wrong": "1+6e13", "eval": true, "eval_new": true}
{"answer": "4.4e7", "artificial_wrong": "1+4.4e7", "eval": true, "eval_new": true}
{"answer": "3e8", "artificial_wrong": "1+3e8", "eval": true, "eval_new": true}
{"answer": "1e11", "artificial_wrong": "1+1e11", "eval": true, "eval_new": true}
{"answer": "400000", "artificial_wrong": "1+400000", "eval": true, "eval_new": true}
{"answer": "5.47e5", "artificial_wrong": "1+5.47e5", "eval": true, "eval_new": true}
{"answer": "2.19e6", "artificial_wrong": "1+2.19e6", "eval": true, "eval_new": true}
{"answer": "1.87e6", "artificial_wrong": "1+1.87e6", "eval": true, "eval_new": true}
{"answer": "4.45e15", "artificial_wrong": "1+4.45e15", "eval": true, "eval_new": true}
{"answer": "9e11", "artificial_wrong": "1+9e11", "eval": true, "eval_new": true}
{"answer": "7.353e14", "artificial_wrong": "1+7.353e14", "eval": true, "eval_new": true}
{"answer": "1.39e9", "artificial_wrong": "1+1.39e9", "eval": true, "eval_new": true}
{"answer": "9.35e5", "artificial_wrong": "1+9.35e5", "eval": true, "eval_new": true}
{"answer": "2.88e16", "artificial_wrong": "1+2.88e16", "eval": true, "eval_new": true}
{"answer": "7.26e6", "artificial_wrong": "1+7.26e6", "eval": true, "eval_new": true}
{"answer": "1.85e5", "artificial_wrong": "1+1.85e5", "eval": true, "eval_new": true}
{"answer": "4.46e19", "artificial_wrong": "1+4.46e19", "eval": true, "eval_new": true}
{"answer": "3.21e13", "artificial_wrong": "1+3.21e13", "eval": true, "eval_new": true}
{"answer": "2.45e6", "artificial_wrong": "1+2.45e6", "eval": true, "eval_new": true}
{"answer": "7.02e5", "artificial_wrong": "1+7.02e5", "eval": true, "eval_new": true}
{"answer": "3.75e9", "artificial_wrong": "1+3.75e9", "eval": true, "eval_new": true}
{"answer": "1.07e16", "artificial_wrong": "1+1.07e16", "eval": true, "eval_new": true}
others the same (272-42-3)
ocw parsing / evaluation 문제는 다음 merge로 해결하였습니다. https://github.com/fgenie/rims_minimal/pull/40#issue-2186864593
utils.llm_query_utils.extract_ans_from_cot_MATHnOCW
를 구현하고 채점해본 결과 GSM fewshot을 통해서 얻어낸 CoT 솔루션들에서 다음과 같은 변화가 관찰됨. 앞으로 이 파싱함수로 실험을 진행함.utils.math_util.is_equiv
를 조금 더 손 본 이후 (맨 앞에 string exact match를 추가하여 보강) 한 뒤로 채점기가 좀더 믿을만해짐. (update)Metric | Math | OCW |
---|---|---|
old_acc |
0.247 | 0.099 |
new_acc |
0.274 (+ 2.7%p) | 0.195 (+ 9.6%p) |
delta correct |
(+ 138 / 4996) | (+26 / 272) |
sp.latex(code_return)
처리가 gsm과 같은 arithmetic 결과는 영향을 주지 않고 ocw, math의 경우에는 필요한 처리를 하는 것을 확인함GSM: (no net change) / 6 None's OCW: (24 net change) / 82 rows change over 272 / 58 None's MATH: (147 net change) /. 1318 rows change over 4996 / 1172 None's
진행 현황은 여기에... 작업브랜치 todolist 이제 체크박스가 얼마 남지 않았음
아래의 이전 메인 결과(used gsm prompt v3)를 변경된 채점 코드로 채점한 경우 https://github.com/fgenie/rims_minimal/issues/35#issue-2123202439
eval fix effect only | model_selection | rims |
---|---|---|
gsm | 1087/1319 (-) | 1114/1319 (-) |
ocw | 36/272 (-) | 39/272 (-) |
math | 1832/4996 (+1) | 1936/4996 (+4) |
OVERLAPS + Individual performance (chatgpt0613long)
prompt = GSM_OLD_BEST (only gsm fewshots)
dataset = gsm (total 1319)
{'all': 829,
'cot_only': 62,
'p2c_only': 32,
'pal_only': 67,
'cotpal-p2c': 85,
'p2ccot-pal': 54,
'palp2c-cot': 73}
single perf =
'cot': 1030,
'p2c': 988,
'pal': 1054,
dataset = math (total 5000)
{'all': 657,
'cot_only': 283,
'p2c_only': 322,
'pal_only': 317,
'p2ccot-pal': 158,
'cotpal-p2c': 134,
'palp2c-cot': 610}
'cot': 1232,
'pal': 1718,
'p2c': 1747,
dataset = ocw (total 272)
{'all': 6,
'cot_only': 13,
'pal_only': 14,
'p2c_only': 10,
'cotpal-p2c': 3,
'p2ccot-pal': 5,
'palp2c-cot': 15}
'cot': 27,
'pal': 38,
'p2c': 36,
https://llm4a.slack.com/archives/C05FKA9C85P/p1711298901583759
Preview:
가져오기로 한 math의 결과가 아래 있습니다. 나머지 결과는 종합되는대로 올려드리겠습니다. 아래와 같은 관점에서 정리할 예정입니다.
이전 결과와 비교하여 어떤 변화가 있는지 (아래 math-baseline 결과에서 미리 보실 수 있습니다)
각 method (cot, pal, p2c) 가 최종 답변 기준으로 서로 얼마나 다른 행동을 보이는지
rims > baseline 인지
rims > ablations (-hint / -hint-mistakes / -hint-mistakes-1st attempt) 인지
gsm CoT 에 약간의 버그를 확인해서 고쳤습니다. 이제 gsm CoT가 제대로 된 점수를 냅니다.
tex 로 된 정답을 채점하는 것은 굉장히 성공률이 낮아보입니다. 원래 minerva에서도 그렇고 수정을 거친 지금도 조금은 나아졌지만 마찬가지입니다.
추가로, rims prompting에 4k가 넘는 long context가 필요하지 않은 것을 확인했습니다.
저번에는 실험과정에 max_token을 과도하게 잡았었기 때문에 16k 모델을 사용하는 것이 강제되었는데, 4k context 안쪽으로 input+output을 커버하는 것을 확인했습니다.
그러나 직전 결과와 비교를 위해 이번 실험은 같은 gpt-3.5-turbo-0613-16k 를 활용합니다. 추후의 실험은 그럴 필요는 없습니다.
math baseline result model = chatgpt0613long temperature = 0
individual performance cot: 1527 / 4999 (30.5%) (former: 24.7%) pal: 2047 / 4999 (40.9%) (former: 34.1%) p2c: 1758 / 4999 (35.2%) (former: 34.5%)
overall performance model-selection-reasoning overall_acc: 2120 / 4999 (42.4%) success_rate: 354 / 2438 (14.5%) 4999 (total) = 2438 (seleciton)
former = GSM fewshot을 활용하여 진행했던 마지막 결과와 비교하기 위해서 아래와 같은 조건은 고정하였습니다 model = chatgpt 0613 long temperature cot pal p2c 성능 및 model selection baseline 이 5%p 정도씩 높아졌습니다. 여기에 영향을 준 요인들은 아래와 같습니다. cot, pal 의 fewshot: (gsm-8shots --> minerva-math-4shots) p2c 의 fewshot: (gsm-8shots --> MBPP 8 shots = plan2code원 논문 프롬프트) 프롬프트의 형태 또한 이전에서 바뀌어서, LLM 생성된 답변에 planning과정이 명시적으로 보이지 않는 경우도 있음, 그러나 이것이 원래 p2c 논문에서 수행하는 것과 같음 math 의 parsing function (cot 결과에 영향을 줌) math 의 evaluation function (결과 점수, 및 majority voting 과정에서도 영향을 줌)
gsm (+4.5%p +59/1319)
, math (+2.3%p, +115/5000)
는 rims > simple greedy
를 만족하는데
ocw (-1.4%p, -4/272)
는 rims ~< simple greedy
가 나오네요. 프롬 몇 개 더 해보면 괜찮게 나오는게 하나는 있을 것 같습니다. ocw는 모수가 작아서요
*math 프롬프트는 두 개 중 하나만 위를 만족, gsm은 하나만 해봤는데 해결, ocw는 두 개까지만 시도해봄
chatgpt 1106
Distinct | GSM | OCW | Math |
---|---|---|---|
cot_only | 57 (4.77%) | 31 (39.74%) | 536 (17.81%) |
pal_only | 53 (4.44%) | 5 (6.41%) | 445 (14.79%) |
p2c_only | 44 (3.69%) | 19 (24.36%) | 299 (9.94%) |
prompt | Overall Accuracy | Success Rate |
---|---|---|
simple greedy | 44/272 (16.2%) | 11/187 (5.9%) |
simple greedy + SC@5 (cotT0.5, palT0.8) | 57 / 272 | 44 / 222 |
----------------------------------------------------------------- | ------------------ | -------------- |
rims_gsm_old | 55 / 272 (20.2%) | 14 / 187 (7.5%) |
----------------------------------------------------------------- | ------------------ | -------------- |
rims (p2c-cot.pal-p2c.cot-p2c) | 63 / 272 (23.2%) | 22 / 155 (14.2%) |
SC@5 T=0.7 | 60 / 272 | 47 / 222 (fail=1) |
SC@5 T=0.5 | 73 / 264 | 60 / 222 |
SC@5 T=0.2 | 66 / 264 | 53 / 222 |
SC@10 T=0.5 | 82 / 249 | 75 / 227 |
SC@10 T=0.2 | 85 / 249 | 78 / 227 |
-hint | 61 / 272 (22.4%) | 20 / 155 (12.9%) |
-hint-mistakes | 60 / 272 (22.1%) | 19 / 155 (12.3%) |
-hint-mistakes-attempt1 | 54 / 272 (19.9%) | 13 / 155 (8.4%) |
----------------------------------------------------------------- | ------------------ | -------------- |
rims' (p2c-cot.pal-p2c.pal-cot) | 57 / 272 (21.0%) | 16 / 187 (8.6%) |
SC@5 T=0.7 | 56 / 272 | 43 / 222 (19.4%) |
SC@5 T=0.5 | 63 / 264 | 50 / 222 |
SC@5 T=0.2 | 65 / 264 | 52 / 222 |
SC@10 T=0.5 | 79 / 249 | 72 / 227 |
SC@10 T=0.2 | 77 / 249 | 70 / 227 (1 fail) |
-hint | 52 / 272 (19.1%) | 11 / 187 (5.9%) |
-hint-mistakes | 56 / 272 (20.6%) | 15 / 187 (8.0%) |
-hint-mistakes-attempt1 | 51 / 272 (18.8%) | 10 / 187 (5.3%) |
----------------------------------------------------------------- | ------------------ | -------------- |
cot | 49 / 272 (18.0%) | - |
pal | 17 / 272 (6.2%) | - |
p2c | 37 / 272 (13.6%) | - |
prompt | Overall Accuracy | Success Rate |
---|---|---|
simple greedy | 1081 / 1319 (82.0%) | 43 / 196 (21.9%) |
SC@15 | 1126 / 1297 ( 22 api errors ) | 261 / 413 (63.2%) |
----------------------------------------------------------------- | ------------------ | -------------- |
rims_gsm_old | 1127 / 1319 (85.4%) | 86 / 193 (44.6%) |
----------------------------------------------------------------- | ------------------ | -------------- |
rims | 1122 / 1319 (85.1%) | 81 / 193 (42.0%) |
-hint | 1131 / 1319 (85.7%) | 90 / 193 (46.6%) |
-hint-mistakes | 1122 / 1319 (85.1%) | 81 / 193 (42.0%) |
-hint-mistakes-attempt1 | 1103 / 1319 (83.6%) | 62 / 193 (32.1%) |
+p2c_rewrote (GSM_RIMS) | 1127 / 1319 (85.4%) | 86 / 193 (44.6%) |
SC@15 T=0.2 | 1151 / 1288 (+9 fails) | 286 / 404 (70.8%) |
SC@15 T=0.5 | 1153 / 1285 (+12 fails) | 288 / 401 (71.8%) |
----------------------------------------------------------------- | ------------------ | -------------- |
rims'+p2c_rewrote (cot2p2c.pal2cot.pal2p2c) (GSM_RIMS1) | 1103 / 1319 (83.6%) | 62 / 193 (32.1%) |
SC@15 T=0.2 | 1150 / 1296 (+1 fails) | 285 / 412 (69.2%) |
SC@15 T=0.5 | 1155 / 1285 (+12 fails) | 290 / 401 (72.3%) |
rims''+p2c_rewrote (pal2p2c.cot2p2c.cot2pal) (GSM_RIMS2) | 1113 / 1319 (84.4%) | 72 / 193 (37.3%) |
SC@15 T=0.2 | 1143 / 1292 (+5 fails) | 278 / 408 (68.1%) |
SC@15 T=0.5 | 1143 / 1292 | 278 / 408 (68.1%) |
----------------------------------------------------------------- | ------------------ | -------------- |
cot | 921 / 1319 (69.8%) | - |
pal | 1038 / 1319 (78.7%) | - |
p2c | 991 / 1319 (75.1%) | - |
prompt | Overall Accuracy | Success Rate |
---|---|---|
simple greedy | 2086 / 4999 (41.7%) | 361 / 2550 (14.2%) |
----------------------------------------------------------------- | ------------------ | -------------- |
rims_gsm_old | 2192 / 4999 (43.8%) | 392 / 2361 (16.6%) |
----------------------------------------------------------------- | ------------------ | -------------- |
rims (p2c-cot.pal-p2c.pal-cot) | 2188 / 4999 (43.8%) | 388 / 2361 (16.4%) |
-hint | 2218 / 4999 (44.4%) | 418 / 2361 (17.7%) |
-hint-mistakes | 2170 / 4999 (43.4%) | 416 / 2503 (16.6%) |
-hint-mistakes-attempt1 | 2151 / 4999 (43.0%) | 351 / 2361 (14.9%) |
----------------------------------------------------------------- | ------------------ | -------------- |
rims (1) | 2191 / 4999 (43.8%) | 391 / 2361 (16.6%) |
-hint | 2166 / 4999 (43.3%) | 366 / 2361 (15.5%) |
-hint-mistakes | 2177 / 4999 (43.5%) | 377 / 2361 (16.0%) |
-hint-mistakes-attempt1 | 2137 / 4999 (42.7%) | 382 / 2500 (15.3%) |
----------------------------------------------------------------- | ------------------ | -------------- |
cot | 1644 / 4999 (32.9%) | |
pal | 1900 / 4999 (38.0%) | |
p2c | 1796 / 4999 (35.9%) |
cot_only | pal_only | p2c_only | |
---|---|---|---|
math (5000) | 557 | 104 | 977 |
ocw (272) | 28 | 5 | 42 |
gsm (1319) | 15 | 13 | 10 |
satisfactory result
prompt | Overall Accuracy | Success Rate (selection max: 38/41 (92.7%)) |
---|---|---|
simple greedy | 1249 / 1319 (94.7%) | 13 / 41 (31.7%) |
rims_gsm_old | 1262 / 1319 (95.7%) | 23 / 31 (56.1%) |
rims_gsm_newer (remove p2c plan from above) | 1259 / 1319 (95.5%) | 20 / 41 (48.8%) |
rims* (p2c2cot.pal2p2c.pal2cot) | 1260 / 1319 (95.5%) | 21 / 41 (51.2%) |
rims* (pal2p2c.cot2p2c.cot2pal) | 1256 / 1319 (95.2%) | 17 / 41 (41.5%) |
rims* (cot2p2c.pal2cot.pal2p2c) | 1259 / 1319 (95.5%) | 20 / 41 (48.8%) |
cot | 1110 / 1319 (84.2%) | |
pal | 1239 / 1319 (93.9%) | |
p2c | 1226 / 1319 (92.9%) |
*those are for unifying reformatted p2c format of MATH
and ocw_courses
satisfactory result
prompt | Overall Accuracy | Success Rate (selection max: 1638/4999 (64.5%)) |
---|---|---|
simple greedy | 2126 / 4999 (42.5%) | 401 / 2539 (15.8%) |
rims_gsm_old | 2539 / 4999 (50.8%) | 814 / 2539 (32.1%) |
rims (p2c-cot.pal-p2c.pal-cot) | 2584 / 4999 (51.7%) | 859 / 2539 (33.8%) |
rims (p2c-cot.pal-p2c.pal-cot) (1) | 2597 / 4999 (52.0%) | 872 / 2539 (34.3%) |
cot | 1828 / 4999 (36.6%) | |
pal | 741 / 4999 (14.8%) | |
p2c | 2468 / 4999 (49.4%) |
*(1) has different question in fewshot blurb
unsatisfying...🧐
prompt | Overall Accuracy | Success Rate (selection max: 85/157 (54.1%)) |
---|---|---|
simple greedy | 69 / 272 (25.4%) | 16 / 157 (10.2%) |
rims_gsm_old | 79 / 272 (29.0%) | 26 / 157 (16.6%) |
rims (p2c-cot.pal-p2c.pal-cot) | 74 / 272 (27.2%) | 21 / 157 (13.4%) |
rims (p2c-cot.pal-p2c.cot-p2c) | 67 / 272 (24.6%) | 14 / 157 (8.9%) |
cot | 61 / 272 (22.4%) | |
pal | 23 / 272 (8.5%) | |
p2c | 78 / 272 (28.7%) |
The following applies to all prompts in gsm exp: rims
, simple-greedy
prompts (last one is finally used)
For MATH and OCW, p2c plans are sometimes appears implicitly, sometimes explicitly (even though the prompts generated those were all explicit!)
# p2c in gsm_old
{NUMBERED LIST} # plan
{CODE} # code
# p2c in gsm_newer (that is, "remove plan" above in gpt4 table)
def solution():
""" docstring usually dropped the plan given in the prompt """
{CODE} # but the code includes kind of numbered comments.
# p2c in rims* above in gpt4 table
def solution():
"""
questions and some explanations
{NUMBERED_LIST_PLAN}
"""
{CODE}
For more, see the prompts below:
rims-correct
, that is, when we have conflict between individual methods;
selection_effect
if the rims answered with the same method that originally answered correctly.reflection_effect
.cot
but rims answers with pal
and evaluated correct, it is considered correct by reflection_effect
.pal
and rims answers with pal
and evaluated correct, it is considered selection_effect
.selection_effect
breaks down to select_cot|pal|p2c
. reflection_effect | selection_effect | select_p2c | select_pal | select_cot | |
---|---|---|---|---|---|
simple greedy | 0.0 % | 100.0 % | 29.4 % | 8.0 % | 62.6 % |
rims_gsm_old (p2c2cot.pal2p2c.pal2cot) | 90.5 % | 9.5 % | 0.6 % | 0.1 % | 8.7 % |
rims* (p2c-cot.pal-p2c.pal-cot) (1) | 85.1 % | 14.9 % | 5.2 % | 0.2 % | 9.5 % |
rims* (p2c-cot.pal-p2c.pal-cot) | 85.6 % | 14.4 % | 5.6 % | 0.1 % | 8.7 % |
reflection_effect | selection_effect | select_p2c | select_pal | select_cot | |
---|---|---|---|---|---|
simple greedy | 0.0% | 100.0% | 7.7% | 38.5% | 53.8% |
rims_gsm_old (p2c2cot.pal2p2c.pal2cot) | 60.9% | 39.1% | 0.0% | 0.0% | 39.1% |
rims* (p2c2cot.pal2p2c.pal2cot) | 66.7% | 33.3% | 0.0% | 0.0% | 33.3% |
rims* (pal2p2c.cot2p2c.cot2pal) | 64.7% | 35.3% | 0.0% | 0.0% | 35.3% |
rims* (cot2p2c.pal2cot.pal2p2c) | 55.0% | 45.0% | 10.0% | 0.0% | 35.0% |
reflection_effect | selection_effect | select_p2c | select_pal | select_cot | |
---|---|---|---|---|---|
simple greedy | 0.0% | 100.0% | 18.8% | 0.0% | 81.2% |
rims_gsm_old (p2c2cot.pal2p2c.pal2cot) | 96.2% | 3.8% | 0.0% | 0.0% | 3.8% |
rims* (p2c2cot.pal2p2c.pal2cot) | 90.5% | 9.5% | 9.5% | 0.0% | 0.0% |
rims* (p2c2cot.pal2p2c.cot2p2c) | 71.4% | 28.6% | 28.6% | 0.0% | 0.0% |
reflection_effect | selection_effect | select_p2c | select_pal | select_cot | |
---|---|---|---|---|---|
simple greedy | 0.0 % | 100.0 % | 3.0 % | 13.3 % | 83.7 % |
rims_gsm_old | 67.1 % | 32.9 % | 5.6 % | 17.6 % | 9.7 % |
rims* p2c-cot.pal-p2c.pal-cot | 75.5 % | 24.5 % | 7.0 % | 2.8 % | 14.7 % |
rims* p2c-cot.pal-p2c.pal-cot (1) | 75.2 % | 24.8 % | 7.9 % | 5.4 % | 11.5 % |
rims* p2c-cot.pal-p2c.pal-cot-hint | 75.6 % | 24.4 % | 6.9 % | 2.4 % | 15.1 % |
rims* p2c-cot.pal-p2c.pal-cot-hint (1) | 77.6 % | 22.4 % | 8.5 % | 0.8 % | 13.1 % |
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes | 27.2 % | 72.8 % | 5.5 % | 9.1 % | 58.2 % |
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes (1) | 70.6 % | 29.4 % | 7.2 % | 5.8 % | 16.4 % |
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes-attempt1 | 70.7 % | 29.3 % | 6.6 % | 14.0 % | 8.8 % |
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes-attempt1 (1) | 30.1 % | 69.9 % | 5.0 % | 12.3 % | 52.6 % |
reflection_effect | selection_effect | select_p2c | select_pal | select_cot | |
---|---|---|---|---|---|
simple greedy | 0.0 % | 100.0 % | 2.3 % | 2.3 % | 95.3 % |
rims_gsm_old | 79.1 % | 20.9 % | 3.5 % | 1.2 % | 16.3 % |
rims_gsm_newer | 69.1 % | 30.9 % | 1.2 % | 2.5 % | 27.2 % |
rims_gsm_newer-hint | 73.3 % | 26.7 % | 0.0 % | 0.0 % | 26.7 % |
rims_gsm_newer-hint-mistakes | 74.1 % | 25.9 % | 0.0 % | 2.5 % | 23.5 % |
rims_gsm_newer-hint-mistakes-attempt1 | 62.9 % | 37.1 % | 1.6 % | 32.3 % | 3.2 % |
rims* p2c2cot.pal2p2c.pal2cot | 69.8 % | 30.2 % | 9.3 % | 2.3 % | 18.6 % |
rims* pal2p2c.cot2p2c.cot2pal | 82.3 % | 17.7 % | 3.2 % | 12.9 % | 1.6 % |
rims* cot2p2c.pal2cot.pal2p2c | 70.8 % | 29.2 % | 8.3 % | 2.8 % | 18.1 % |
reflection_effect | selection_effect | select_p2c | select_pal | select_cot | |
---|---|---|---|---|---|
simple greedy | 0.0 % | 100.0 % | 18.2 % | 0.0 % | 81.8 % |
rims_gsm_old | 75.0 % | 25.0 % | 12.5 % | 6.2 % | 6.2 % |
rims* p2c-cot.pal-p2c.cot-p2c | 59.1 % | 40.9 % | 31.8 % | 0.0 % | 9.1 % |
rims* p2c-cot.pal-p2c.pal-cot | 73.3 % | 26.7 % | 6.7 % | 0.0 % | 20.0 % |
rims* p2c-cot.pal-p2c.cot-p2c-hint | 80.0 % | 20.0 % | 5.0 % | 0.0 % | 15.0 % |
rims* p2c-cot.pal-p2c.pal-cot-hint | 80.0 % | 20.0 % | 0.0 % | 0.0 % | 20.0 % |
rims* p2c-cot.pal-p2c.cot-p2c-hint-mistakes | 73.7 % | 26.3 % | 15.8 % | 0.0 % | 10.5 % |
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes | 86.7 % | 13.3 % | 6.7 % | 0.0 % | 6.7 % |
rims* p2c-cot.pal-p2c.cot-p2c-hint-mistakes-attempt1 | 100.0 % | 0.0 % | 0.0 % | 0.0 % | 0.0 % |
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes-attempt1 | 70.0 % | 30.0 % | 20.0 % | 0.0 % | 10.0 % |
(승재님) rims의 비용적인 면을 잘 계산해두는게 좋을 것 같다. (chul) (2) 에다가 self-consistency를 적용하는게 좋겠다. simple-greedy + SC (1) 한 것과 성능 비교 (chatgpt) (chul) opensource LLM 실험은 웬만하면 여기서? (기창님이 여유가 없다)
(will update further after math done on Tue)
gpt-3.5-turbo-1106
rims
and simple-greedy (baseline)
SC helpssimple-greedy (baseline)
@ SC5rims
> simple-greedy
gpt-3.5-turbo-1106
note that rims (blabla) is just for example combination + orderings.
prompt | Overall Accuracy | Success Rate |
---|---|---|
simple greedy | 44/272 (16.2%) | 11/187 (5.9%) |
SC@5 (cotT0.5, palT0.8) | 57 / 272 | 44 / 222 |
SC@10 | 69 / 249 (23 failed) | 62 / 227 |
----------------------------------------------------------------- | ------------------ | -------------- |
rims (p2c-cot.pal-p2c.cot-p2c) | 63 / 272 (23.2%) | 22 / 155 (14.2%) |
SC@5 T=0.2 | 66 / 264 | 53 / 222 |
SC@5 T=0.5 | 73 / 264 | 60 / 222 |
SC@5 T=0.7 | 60 / 272 | 47 / 222 (fail=1) |
SC@10 T=0.2 | 85 / 249 | 78 / 227 |
SC@10 T=0.5 | 82 / 249 | 75 / 227 |
----------------------------------------------------------------- | ------------------ | -------------- |
rims (p2c-cot.pal-p2c.pal-cot) | 57 / 272 (21.0%) | 16 / 187 (8.6%) |
SC@5 T=0.2 | 65 / 264 | 52 / 222 |
SC@5 T=0.5 | 63 / 264 | 50 / 222 |
SC@5 T=0.7 | 56 / 272 | 43 / 222 (19.4%) |
SC@10 T=0.2 | 77 / 249 | 70 / 227 (1 fail) |
SC@10 T=0.5 | 79 / 249 | 72 / 227 |
gpt-3.5-turbo-1106 |
prompt | Overall Accuracy | Success Rate |
---|---|---|---|
simple greedy | 1081 / 1319 (82.0%) | 43 / 196 (21.9%) | |
SC@15 | 1126 / 1297 ( 22 api errors ) | 261 / 413 (63.2%) | |
----------------------------------------------------------------- | ------------------ | -------------- | |
rims (newer_best_p2c2cot.pal2p2c.pal2cot) | 1122 / 1319 (85.1%) | 81 / 193 (42.0%) | |
SC@15 T=0.2 | 1151 / 1288 (+9 fails) | 286 / 404 (70.8%) | |
SC@15 T=0.5 | 1153 / 1285 (+12 fails) | 288 / 401 (71.8%) | |
----------------------------------------------------------------- | ------------------ | -------------- | |
rims (cot2p2c.pal2cot.pal2p2c) (GSM_RIMS1) | 1103 / 1319 (83.6%) | 62 / 193 (32.1%) | |
SC@15 T=0.2 | 1150 / 1296 (+1 fails) | 285 / 412 (69.2%) | |
SC@15 T=0.5 | 1155 / 1285 (+12 fails) | 290 / 401 (72.3%) | |
rims (pal2p2c.cot2p2c.cot2pal) (GSM_RIMS2) | 1113 / 1319 (84.4%) | 72 / 193 (37.3%) | |
SC@15 T=0.2 | 1143 / 1292 (+5 fails) | 278 / 408 (68.1%) | |
SC@15 T=0.5 | 1143 / 1292 | 278 / 408 (68.1%) |
Somethings going super wrong (SC@5 << T=0 n=1)
prompt | Overall Accuracy | Success Rate |
---|---|---|
simple greedy | 2086 / 4999 (41.7%) | 361 / 2550 (14.2%) |
SC@5, cotT=0.5 / palT=0.8 | 439 / 4904 (9.0%) | 318 / 3684 (8.6%) |
----------------------------------------------------------------- | ------------------ | -------------- |
rims (p2c-cot.pal-p2c.pal-cot) | 2188 / 4999 (43.8%) | 388 / 2361 (16.4%) |
SC@5, T=0.2 | 536 / 4897 (10.9%) | 415 / 3677 (11.3%) |
SC@5, T=0.5 | ||
----------------------------------------------------------------- | ------------------ | -------------- |
rims (1) | 2191 / 4999 (43.8%) | 391 / 2361 (16.6%) |
SC@5, T=0.2 | ||
SC@5, T=0.5 |
@strutive07 @fgenie chul
chul: openllm 은 위 둘을 사용하는걸로
Claude-3.5-sonnet 견적 내보기 (https://www.computerworld.com/article/2472913/anthropic-claude-3-5-sonnet-is-here-and-its-free.html)
Temperature = 0.
, n=1
GSM8K | ||||||
---|---|---|---|---|---|---|
Model | Score file | cot | pal | p2c | simple greedy | rims |
Meta-Llama-3-8B-Instruct | link | 0.7301 | 0.7597 | 0.6513 | 0.817 (1078/1319) | 0.831 (1096/1319) |
Phi-3-small-128k-instruct | link | 0.8438 | 0.8635 | 0.8097 | 0.906 (1195/1319) | 0.920 (1213/1319) |
Math | ||||||
---|---|---|---|---|---|---|
Model | Score file | cot | pal | p2c | simple greedy | rims |
Meta-Llama-3-8B-Instruct | link | 0.3016 | 0.1498 | 0.2134 | 0.319 (1597/5000) | 0.320 (1601/5000) |
Phi-3-small-128k-instruct | link | 0.3684 | 0.3808 | 0.3628 | 0.462 (2308/5000) | 0.414 (2072/5000) |
OCW | ||||||
---|---|---|---|---|---|---|
Model | Score file | cot | pal | p2c | simple greedy | rims |
Meta-Llama-3-8B-Instruct | link | 0.1213 | 0.0257 | 0.0662 | 0.121 (33/272) | 0.110 (30/272) |
Phi-3-small-128k-instruct | link | 0.2684 | 0.1360 | 0.1801 | 0.199 (54/272) | 0.165 (45/272) |
...voting이 보장하는 selection 알고리즘 성능 최저점 (각각이 맞아야함 + 맞는 답끼리 서로 일치하는지 파싱이 성공해야함) 이 매우 다릅니다. gsm은 각각 방법보다 voting만으로도 셋의 최고점 이상에서 시작하며 나머지 둘은 아닌 것처럼 보이네요.
해당 결과 파일은 5e0bb5f6a378d39da8d8102485350383ac6bfa60 에서 확인할 수 있습니다. @strutive07
Feb 11
Submission deadline
ACL--> arXiv 후 NeuripsOCW low perf:
struggling with symbolics
Click: symbolic v. numeric perf.
chatgptnonconflict_numeric
30/82
nonconflict_symbolic
0/28
conflict_rims_numeric
9/108
conflict_rims_symbolic
0/52
conflict_base_numeric
6/109
conflict_base_symbol
0/53
gpt4turbo
nonconflict_numeric
52/105
nonconflict_symbolic
1/39
conflict_rims_numeric
10/76
conflict_rims_symbolic
0/41
conflict_base_numeric
7/72
conflict_base_symbol
0/36
openLLM 실험
LLM:
새 데이터?
(svamp는 gpt4 에서 saturate)
To Report
must
total performance
selection success rate (selection 방법론 비교 관점)
reflection-less (ablation)
good to report