회의록 Feb 11~ #37

Feb 11

Submission deadline

ACL --> arXiv 후 Neurips

OCW low perf:

struggling with symbolics

Click: symbolic v. numeric perf. chatgpt

openLLM 실험


새 데이터?

(svamp는 gpt4 에서 saturate)

To Report


total performance

selection success rate (selection 방법론 비교 관점)

reflection-less (ablation)

good to report

fgenie commented 4 months ago

related papers

reasoning with LLM

fgenie commented 4 months ago

지금 rims prompt 는 아래 둘을 합쳐놓은 방법이다.

이것들을 드러내려면 어떤 것들을 살펴볼 수 있을까?

  1. model selection과 선택 경향을 비교
  2. 만약 reflection 후에 같은 reasoning method를 택한다면 (e.g. CoT -> CoT) 어떻게 답이 달라지는가?
  3. 혹은 위와 같은 경우가 없다면 model selection과 같은 방향을 선택했을 때에 model selection은 틀리고 rims는 맞는 경우들을 세볼 수 있을 것이다.
  4. 추가로 둘이서 conflict set 내의 맞추는 경우 못맞추는 경우가 어떤 식으로 나뉘는지 보면 좋겠다.

그 외에 부가적 분석으로

fgenie commented 4 months ago

어떤 문제를 풀 수 있는 방법이 한가지로 특정되는 상황에서 rims가 잘 되는경향이 있다면 최고 좋긴 한데

생각보다 cot pal이 어떤 경우에나 적용이 가능한 것 같아 보이기도 한다. 특정 형태에 대해 불가능을 상정하고 실험을 해야하는걸까

fgenie commented 4 months ago

Failure in symbolic problem solving (OCW)

both RIMS / model-selection-reasoning baseline prompts from GSM examples utterly fails on symbolic reasoning. (=v3 prompt, currently best-performing prompt) prompt_construction_src/prep_rims_prompts/gsm_prompts/3_reflectonce_p2c2cot.pal2p2c.pal2cot.txt_rm_ans

chatgpt0613long numeric symbolic
non-conflict 30/82 0/28
conflict (baseline) 6/109 0/53
conflict (rims) 9/108 0/52

*(missing 1 numeric and 1 symbolic = / in rims results are failed to parse)

gpt4turbo numeric symbolic
non-conflict 52/105 1/39
conflict (baseline) 7/72 0/36
conflict (rims) 10/76 0/41

(missing 14 numerics and 6 symbolics in baseline / 20 missings are failed to parse ) (missing 10 numerics and 1 symbolics in rims / 11 are failed to parse )

OCW type numeric 191 expression (tex) 69 equation 12

fgenie commented 4 months ago

Math analyses

1: According to MATH-type

In short, type-wise analyses shows,

RIMS v3 >= model-selection-baseline ~ PAL, P2C > CoT

but in aggregate,

RIMS v3 >= model-selection-baseline > PAL, P2C, CoT

RIMS prompt v3 (originated from GSM train examples)

Full Table

MATH-types method in-total perf. (cot;pal;p2c)
% conflict
% non-conflict
Geometry baseline 102 (91;79;81) / 479 21.3% (19.0;16.5;16.9 %) 28 / 275 10.2% 74 / 544 13.6%
rims 102 / 479 21.3% 28 / 275 10.2%*
Number Theory baseline 279 (127;303;294) / 540 51.7% (23.5;56.1;54.4 %) 10 / 208 4.8% 269 / 544 49.4%
rims 314 / 540 58.1% 45 / 208 21.6%
Prealgebra baseline 505 (369;473;463) / 871 58.0% (42.4;54.3;53.2 %) 35 / 247 14.2% 470 / 544 86.4%
rims 529 / 871 60.7% 59 / 247 23.9%
Algebra baseline 585 (423;508;551) / 1187 49.3% (35.6;42.7;46.4 %) 78 / 481 16.2% 507 / 544 93.2%
rims 606 / 1187 51.1% 99 / 481 20.6%
Counting & Probability baseline 169 (106;184;163) / 474 35.7% (22.4;38.8;34.4 %) 13 / 223 5.8% 156 / 544 28.7%
rims 182 / 474 38.4% 26 / 223 11.7%
Intermediate Algebra baseline 137 (81;116;129) / 901 15.2% (9.0;12.8;14.3 %) 27 / 517 5.2% 110 / 544 20.2%
rims 143 / 901 15.9% 33 / 516 6.4%*
Precalculus baseline 54 (35;41;45) / 544 9.9% (6.4;7.5;8.3 %) 11 / 309 3.6% 43 / 544 7.9%
rims 56 / 544 10.3% 13 / 309 4.2%*


baseline 1831/4996 (36.65 \%)
rims 1932/4996 (38.67 \%)
rims_abl 1769/4996 (35.41 \%)
cot 1232/4996 (24.7%)
pal 1704/4996 (34.1%)
p2c 1726/4996 (34.5%)
fgenie commented 4 months ago

2: According to MATH-level

Level method Conflict
1 Baseline 14/90 (15.6%) 281/1324 (21.2%) 295/437 (67.5%) 229/437 (52.4%) 269/437 (61.6%) 285/437 (65.2%)
1 Rims 20/90 (22.2%) - 301/437 (68.9%) - - -
2 Baseline 38/292 (13.0%) 439/1324 (33.2%) 477/893 (53.4%) 344/893 (38.5%) 436/893 (48.8%) 456/893 (51.1%)
2 Rims 59/292 (20.2%) - 498/893 (55.8%) - - -
3 Baseline 42/483 (8.7%) 425/1324 (32.1%) 467/1128 (41.4%) 300/1128 (26.6%) 454/1128 (40.2%) 438/1128 (38.8%)
3 Rims 87/483 (18.0%) - 512/1128 (45.4%) - - -
4 Baseline 54/591 (9.1%) 321/1324 (24.2%) 375/1214 (30.9%) 235/1214 (19.4%) 339/1214 (27.9%) 344/1214 (28.3%)
4 Rims 76/591 (12.9%) - 397/1214 (32.7%) - - -
5 Baseline 54/804 (6.7%) 163/1324 (12.3%) 217/1324 (16.4%) 124/1324 (9.4%) 206/1324 (15.6%) 203/1324 (15.3%)
5 Rims 61/803 (7.6%) - 224/1324 (16.9%) - - -
fgenie commented 4 months ago

구성비로 설명이 되는 것 같진 않음

Level 1 Level 2 Level 3 Level 4 Level 5
Algebra 135 (30.89%) 201 (22.51%) 261 (23.14%) 283 (23.31%) 307 (23.19%)
Counting & Probability 39 (8.92%) 101 (11.31%) 100 (8.87%) 111 (9.14%) 123 (9.29%)
Geometry 38 (8.7%) 82 (9.18%) 102 (9.04%) 125 (10.3%) 132 (9.97%)
Intermediate Algebra 52 (11.9%) 128 (14.33%) 193 (17.11%) 248 (20.43%) 280 (21.15%)
Number Theory 30 (6.86%) 92 (10.3%) 122 (10.82%) 142 (11.7%) 154 (11.63%)
Prealgebra 86 (19.68%) 177 (19.82%) 224 (19.86%) 191 (15.73%) 193 (14.58%)
Precalculus 57 (13.04%) 112 (12.54%) 126 (11.17%) 114 (9.39%) 135 (10.2%)
fgenie commented 4 months ago
fgenie commented 4 months ago

방법별 겹치는 정도에 대한 해석은 다시 생각해봐야할 것 같네요. 수정후에 다시 공유드리겠습니다.

How unique each method is?

MATH에서 문제 유형별로 각 방법이 얼마나 맞추는지, 그 중 unique하게 풀 수 있는 비율은 얼마인지 세보았습니다. 결론은, 정답률이 높은 유형일 수록 uniqueness는 떨어집니다.

corrects_lvwise_MATH corrects_typewise_MATH

expand: datapoints Levels: Level 1 437 * cot_corrects: 23 / 229 (10.0) * pal_corrects: 15 / 269 (5.6) * p2c_corrects: 23 / 285 (8.1) Level 2 893 * cot_corrects: 59 / 344 (17.2) * pal_corrects: 60 / 436 (13.8) * p2c_corrects: 62 / 456 (13.6) Level 3 1128 * cot_corrects: 56 / 300 (18.7) * pal_corrects: 93 / 454 (20.5) * p2c_corrects: 71 / 438 (16.2) Level 4 1214 * cot_corrects: 75 / 235 (31.9) * pal_corrects: 70 / 339 (20.6) * p2c_corrects: 77 / 344 (22.4) Level 5 1324 * cot_corrects: 70 / 124 (56.5) * pal_corrects: 71 / 206 (34.5) * p2c_corrects: 74 / 203 (36.5) Types: Algebra: 1187 * cot_corrects (423/1187, 35.6\%): 99 / 423 (23.4\%) * pal_corrects (508/1187, 42.8\%): 77 / 508 (15.2\%) * p2c_corrects (551/1187, 46.4\%): 100 / 551 (18.1\%) Counting & Probability: 474 * cot_corrects (106/474, 22.4\%): 19 / 106 (17.9\%) * pal_corrects (184/474, 38.8\%): 38 / 184 (20.7\%) * p2c_corrects (163/474, 34.4\%): 24 / 163 (14.7\%) Geometry: 479 * cot_corrects (91/479, 19.0\%): 37 / 91 (40.7\%) * pal_corrects (79/479, 16.5\%): 21 / 79 (26.6\%) * p2c_corrects (81/479, 16.9\%): 18 / 81 (22.2\%) Intermediate Algebra: 901 * cot_corrects (81/901, 9.0\%): 37 / 81 (45.7\%) * pal_corrects (116/901, 12.9\%): 47 / 116 (40.5\%) * p2c_corrects (129/901, 14.3\%): 49 / 129 (38.0\%) Number Theory: 540 * cot_corrects (127/540, 23.5\%): 16 / 127 (12.6\%) * pal_corrects (303/540, 56.1\%): 62 / 303 (20.5\%) * p2c_corrects (294/540, 54.4\%): 55 / 294 (18.7\%) Prealgebra: 871 * cot_corrects (369/871, 42.4\%): 52 / 369 (14.1\%) * pal_corrects (473/871, 54.3\%): 47 / 473 (9.9\%) * p2c_corrects (463/871, 53.2\%): 41 / 463 (8.9\%) Precalculus: 544 * cot_corrects (35/544, 6.4\%): 23 / 35 (65.7\%) * pal_corrects (41/544, 7.5\%): 17 / 41 (41.5\%) * p2c_corrects (45/544, 8.3\%): 20 / 45 (44.4\%)

# Interpretation & plan 그래서 현재의 cot, pal, p2c (with gsm examples) 는 서로 완전히 동치는 아닙니다만

방법별 10% 내외의 정답률 차이를 각 방법의 이점이라 생각한다면 이에 대해서 방법 선택을 하는 빈도를 세볼 수 있기는 합니다.

방법별 차이를 더 두드러지게 하고 싶다면 다음의 실행적 방안이 있습니다.

  1. 각 방법별 prompt few shot example (혹은 rims blurb)을 gsm 통일이 아니라 각자 다른 형태로 분화시키는 것입니다. 예를 들면
    • cot - from math symbolic examples
    • pal - gsm arithmetics
    • p2c - ocw symbolics 이렇게하여 선택의 이점을 더 강화시킬 수 있다면 논문에서 제안할 수 있는 이야기가 더 많을 것입니다.
  2. 이 때 걸리적거릴 수 있는 부분은 SOTA performance에 대한 얘기입니다. model selection을 위처럼 기존과 다른 세팅으로 진행할 경우, 우리 방법론에 최적화된 실험을 위해 이를 수행했다는 의심을 살 수 있습니다. 그래서, model selection과의 비교는 cot + pal 2 method with only gsm prompts (Model Selection Reasoning과 같은 세팅) 으로 이 걱정을 덜 수 있습니다.
  3. 사실 p2c와 pal 은 이렇게라도 하지 않으면 본질적으로 그 특징이 비슷할 것으로 여겨집니다 (docstring을 안에 쓰냐 밖에 쓰냐의 차이). p2c대신 self-discover와 같은 방법으로 대체해보는 방법도 있을 것 같습니다 (아직 self-discover를 자세히 살펴보진 못했습니다).
fgenie commented 4 months ago

separating selection and feedback effect (chatgpt, MATH)

MATH = 4996 rows

selection effect feedback_effect in total
model_selection 202 (4.0 %p) 0 202
rims 95 (1.9 %p) 208 (4.2 %p) 303 (6.1 %p)
upperbound 722 - 722 (14.5 %p)
fgenie commented 4 months ago

Feb 18

fgenie commented 4 months ago

Feb 25

fgenie commented 4 months ago

OCW prompt 가공중에

ocw evaluation 이 이상하다.

is_equiv_ocw cannot parse and check the answer from the provided prompt's answers. Why --> flawed parsing and equivalence logic. (surprisingly from the author's code)

OCW result re-measured with is_equiv_ocw modified to normalize_symbolic_exp + is_equiv_exp from normalize_final_answer + is_equiv_tex


27 / 272 (9.9%)


38 / 272 (14.0%)


36 / 272 (13.2%)

mostly the same results... this does not explain anything about my modification to the original eval code single-handedly. I should check for each configuration

fgenie commented 3 months ago

Mar 10

어제 Azure endpoint 태형님이랑 확인한거, 그리고 ocw scoring 함수 고쳐서 테스트해보았습니다.


// eval_new correct (3)
{"answer": "x_{0} \\cos (\\omega t)+$ $\\dot{x}_{0} \\sin (\\omega t) / \\omega", "artificial_wrong": "1+x_{0} \\cos (\\omega t)+$ $\\dot{x}_{0} \\sin (\\omega t) / \\omega", "eval": true, "eval_new": false}
{"answer": "\\frac{1}{b-a}\\left(e^{-a t}-e^{-b t}\\right)", "artificial_wrong": "1+\\frac{1}{b-a}\\left(e^{-a t}-e^{-b t}\\right)", "eval": "EVAL_FAIL! cannot determine truth value of Relational", "eval_new": false}
{"answer": "m_{p} c^{2}\\left(\\gamma^{2}-1\\right) \\sin ^{2} \\theta", "artificial_wrong": "1+m_{p} c^{2}\\left(\\gamma^{2}-1\\right) \\sin ^{2} \\theta", "eval": "EVAL_FAIL! cannot determine truth value of Relational", "eval_new": false}

// both wrong (42)
{"answer": "4.5e33", "artificial_wrong": "1+4.5e33", "eval": true, "eval_new": true}
{"answer": "3.83e35", "artificial_wrong": "1+3.83e35", "eval": true, "eval_new": true}
{"answer": "8.7e8", "artificial_wrong": "1+8.7e8", "eval": true, "eval_new": true}
{"answer": "4e33", "artificial_wrong": "1+4e33", "eval": true, "eval_new": true}
{"answer": "3.3e12", "artificial_wrong": "1+3.3e12", "eval": true, "eval_new": true}
{"answer": "3e6", "artificial_wrong": "1+3e6", "eval": true, "eval_new": true}
{"answer": "7e37", "artificial_wrong": "1+7e37", "eval": true, "eval_new": true}
{"answer": "7.5e7", "artificial_wrong": "1+7.5e7", "eval": true, "eval_new": true}
{"answer": "2e27", "artificial_wrong": "1+2e27", "eval": true, "eval_new": true}
{"answer": "2.75e11", "artificial_wrong": "1+2.75e11", "eval": true, "eval_new": true}
{"answer": "6e13", "artificial_wrong": "1+6e13", "eval": true, "eval_new": true}
{"answer": "4.4e7", "artificial_wrong": "1+4.4e7", "eval": true, "eval_new": true}
{"answer": "3e8", "artificial_wrong": "1+3e8", "eval": true, "eval_new": true}
{"answer": "1e11", "artificial_wrong": "1+1e11", "eval": true, "eval_new": true}
{"answer": "400000", "artificial_wrong": "1+400000", "eval": true, "eval_new": true}
{"answer": "5.47e5", "artificial_wrong": "1+5.47e5", "eval": true, "eval_new": true}
{"answer": "2.19e6", "artificial_wrong": "1+2.19e6", "eval": true, "eval_new": true}
{"answer": "1.87e6", "artificial_wrong": "1+1.87e6", "eval": true, "eval_new": true}
{"answer": "4.45e15", "artificial_wrong": "1+4.45e15", "eval": true, "eval_new": true}
{"answer": "9e11", "artificial_wrong": "1+9e11", "eval": true, "eval_new": true}
{"answer": "7.353e14", "artificial_wrong": "1+7.353e14", "eval": true, "eval_new": true}
{"answer": "1.39e9", "artificial_wrong": "1+1.39e9", "eval": true, "eval_new": true}
{"answer": "9.35e5", "artificial_wrong": "1+9.35e5", "eval": true, "eval_new": true}
{"answer": "2.88e16", "artificial_wrong": "1+2.88e16", "eval": true, "eval_new": true}
{"answer": "7.26e6", "artificial_wrong": "1+7.26e6", "eval": true, "eval_new": true}
{"answer": "1.85e5", "artificial_wrong": "1+1.85e5", "eval": true, "eval_new": true}
{"answer": "4.46e19", "artificial_wrong": "1+4.46e19", "eval": true, "eval_new": true}
{"answer": "3.21e13", "artificial_wrong": "1+3.21e13", "eval": true, "eval_new": true}
{"answer": "2.45e6", "artificial_wrong": "1+2.45e6", "eval": true, "eval_new": true}
{"answer": "7.02e5", "artificial_wrong": "1+7.02e5", "eval": true, "eval_new": true}
{"answer": "3.75e9", "artificial_wrong": "1+3.75e9", "eval": true, "eval_new": true}
{"answer": "1.07e16", "artificial_wrong": "1+1.07e16", "eval": true, "eval_new": true}

others the same (272-42-3)
fgenie commented 3 months ago

CoT 결과 처리 수정

ocw parsing / evaluation 문제는 다음 merge로 해결하였습니다. https://github.com/fgenie/rims_minimal/pull/40#issue-2186864593


Metric Math OCW
old_acc 0.247 0.099
new_acc 0.274 (+ 2.7%p) 0.195 (+ 9.6%p)
delta correct (+ 138 / 4996) (+26 / 272)
fgenie commented 3 months ago

PAL/P2C (코드) 실행 결과 처리 수정

GSM: (no net change) / 6 None's OCW: (24 net change) / 82 rows change over 272 / 58 None's MATH: (147 net change) /. 1318 rows change over 4996 / 1172 None's

fgenie commented 3 months ago

진행 현황은 여기에... 작업브랜치 todolist 이제 체크박스가 얼마 남지 않았음

fgenie commented 3 months ago

Mar 17

fgenie commented 3 months ago

바뀐 채점 코드로 생기는 이전 결과의 변동

아래의 이전 메인 결과(used gsm prompt v3)를 변경된 채점 코드로 채점한 경우 https://github.com/fgenie/rims_minimal/issues/35#issue-2123202439

eval fix effect only model_selection rims
gsm 1087/1319 (-) 1114/1319 (-)
ocw 36/272 (-) 39/272 (-)
math 1832/4996 (+1) 1936/4996 (+4)

be cautious!

fgenie commented 3 months ago

OVERLAPS + Individual performance (chatgpt0613long)

prompt = GSM_OLD_BEST (only gsm fewshots)

dataset = gsm (total 1319)
{'all': 829,
'cot_only': 62,
'p2c_only': 32,
'pal_only': 67,
'cotpal-p2c': 85,
'p2ccot-pal': 54,
'palp2c-cot': 73}

single perf = 
'cot': 1030,
'p2c': 988,
'pal': 1054,

dataset = math (total 5000)
{'all': 657,
'cot_only': 283,
'p2c_only': 322,
'pal_only': 317,
'p2ccot-pal': 158,
'cotpal-p2c': 134,
'palp2c-cot': 610}

'cot': 1232,
'pal': 1718,
'p2c': 1747,

dataset = ocw (total 272)

{'all': 6,
'cot_only': 13,
'pal_only': 14,
'p2c_only': 10,
'cotpal-p2c': 3,
'p2ccot-pal': 5,
'palp2c-cot': 15}

'cot': 27,
'pal': 38,
'p2c': 36,
fgenie commented 3 months ago

Mar 24


Preview: 가져오기로 한 math의 결과가 아래 있습니다. 나머지 결과는 종합되는대로 올려드리겠습니다. 아래와 같은 관점에서 정리할 예정입니다. 이전 결과와 비교하여 어떤 변화가 있는지 (아래 math-baseline 결과에서 미리 보실 수 있습니다) 각 method (cot, pal, p2c) 가 최종 답변 기준으로 서로 얼마나 다른 행동을 보이는지 rims > baseline 인지 rims > ablations (-hint / -hint-mistakes / -hint-mistakes-1st attempt) 인지 gsm CoT 에 약간의 버그를 확인해서 고쳤습니다. 이제 gsm CoT가 제대로 된 점수를 냅니다. tex 로 된 정답을 채점하는 것은 굉장히 성공률이 낮아보입니다. 원래 minerva에서도 그렇고 수정을 거친 지금도 조금은 나아졌지만 마찬가지입니다.
추가로, rims prompting에 4k가 넘는 long context가 필요하지 않은 것을 확인했습니다. 저번에는 실험과정에 max_token을 과도하게 잡았었기 때문에 16k 모델을 사용하는 것이 강제되었는데, 4k context 안쪽으로 input+output을 커버하는 것을 확인했습니다. 그러나 직전 결과와 비교를 위해 이번 실험은 같은 gpt-3.5-turbo-0613-16k 를 활용합니다. 추후의 실험은 그럴 필요는 없습니다.

math baseline result model = chatgpt0613long temperature = 0

individual performance cot: 1527 / 4999 (30.5%) (former: 24.7%) pal: 2047 / 4999 (40.9%) (former: 34.1%) p2c: 1758 / 4999 (35.2%) (former: 34.5%)

overall performance model-selection-reasoning overall_acc: 2120 / 4999 (42.4%) success_rate: 354 / 2438 (14.5%) 4999 (total) = 2438 (seleciton)

former = GSM fewshot을 활용하여 진행했던 마지막 결과와 비교하기 위해서 아래와 같은 조건은 고정하였습니다 model = chatgpt 0613 long temperature cot pal p2c 성능 및 model selection baseline 이 5%p 정도씩 높아졌습니다. 여기에 영향을 준 요인들은 아래와 같습니다. cot, pal 의 fewshot: (gsm-8shots --> minerva-math-4shots) p2c 의 fewshot: (gsm-8shots --> MBPP 8 shots = plan2code원 논문 프롬프트) 프롬프트의 형태 또한 이전에서 바뀌어서, LLM 생성된 답변에 planning과정이 명시적으로 보이지 않는 경우도 있음, 그러나 이것이 원래 p2c 논문에서 수행하는 것과 같음 math 의 parsing function (cot 결과에 영향을 줌) math 의 evaluation function (결과 점수, 및 majority voting 과정에서도 영향을 줌)

fgenie commented 3 months ago

mar 27

gsm (+4.5%p +59/1319), math (+2.3%p, +115/5000)rims > simple greedy 를 만족하는데 ocw (-1.4%p, -4/272)rims ~< simple greedy 가 나오네요. 프롬 몇 개 더 해보면 괜찮게 나오는게 하나는 있을 것 같습니다. ocw는 모수가 작아서요

*math 프롬프트는 두 개 중 하나만 위를 만족, gsm은 하나만 해봤는데 해결, ocw는 두 개까지만 시도해봄

fgenie commented 3 months ago

mar 30 (updated: apr 1)

OVERLAPs between methods

chatgpt 1106

Distinct GSM OCW Math
cot_only 57 (4.77%) 31 (39.74%) 536 (17.81%)
pal_only 53 (4.44%) 5 (6.41%) 445 (14.79%)
p2c_only 44 (3.69%) 19 (24.36%) 299 (9.94%)
chatgpt0613 overlaps # OVERLAPS * model: `chatgpt0613long` * gsm overlap does not change much * ocw possible pool increase 9.5% among total (+ 26/272), cot_only (12->28) and p2c_only (5->17) increases significantly. pal has minor change * possible pool in math also increases 9.3% (+466/5000), cot_only (283->481) and pal_only (317->485) increases significantly. p2c_only (322->265) decreases significantly. ## after applying fewshot change * cot: gsm-8-shot / ocw-4-shot / math-4-shot * pal: gsm-8-shot / ocw-4-shot / math-4-shot * pal: mbpp-8-shot | Distinct | GSM | OCW | Math | |------------|----------------|----------------|----------------| | cot_only | 55 (4.60%) | 28 (30.77%) | 481 (16.32%) | | pal_only | 83 (6.94%) | 12 (13.19%) | 485 (16.46%) | | p2c_only | 33 (2.76%) | 17 (18.68%) | 265 (8.99%) | | **cot or pal or p2c ** | **1196** | **91** | **2947** | ## previous result (before prompt-few shot diversifying) * cot: gsm-8-shot * pal: gsm-8-shot * p2c: gsm-8-shot | Distinct | GSM | OCW | Math | |------------|----------------|----------------|----------------| | cot_only | 62 (5.16%) | 12 (18.46%) | 283 (11.41%) | | pal_only | 67 (5.57%) | 15 (23.08%) | 317 (12.78%) | | p2c_only | 32 (2.66%) | 5 (7.69%) | 322 (12.98%) | | **cot or pal or p2c ** | **1202** | **65** | **2481** |
fgenie commented 3 months ago

mar 30 (1)


ocw 1106

prompt Overall Accuracy Success Rate
simple greedy 44/272 (16.2%) 11/187 (5.9%)
simple greedy + SC@5 (cotT0.5, palT0.8) 57 / 272 44 / 222
----------------------------------------------------------------- ------------------ --------------
rims_gsm_old 55 / 272 (20.2%) 14 / 187 (7.5%)
----------------------------------------------------------------- ------------------ --------------
rims (p2c-cot.pal-p2c.cot-p2c) 63 / 272 (23.2%) 22 / 155 (14.2%)
SC@5 T=0.7 60 / 272 47 / 222 (fail=1)
SC@5 T=0.5 73 / 264 60 / 222
SC@5 T=0.2 66 / 264 53 / 222
SC@10 T=0.5 82 / 249 75 / 227
SC@10 T=0.2 85 / 249 78 / 227
-hint 61 / 272 (22.4%) 20 / 155 (12.9%)
-hint-mistakes 60 / 272 (22.1%) 19 / 155 (12.3%)
-hint-mistakes-attempt1 54 / 272 (19.9%) 13 / 155 (8.4%)
----------------------------------------------------------------- ------------------ --------------
rims' (p2c-cot.pal-p2c.pal-cot) 57 / 272 (21.0%) 16 / 187 (8.6%)
SC@5 T=0.7 56 / 272 43 / 222 (19.4%)
SC@5 T=0.5 63 / 264 50 / 222
SC@5 T=0.2 65 / 264 52 / 222
SC@10 T=0.5 79 / 249 72 / 227
SC@10 T=0.2 77 / 249 70 / 227 (1 fail)
-hint 52 / 272 (19.1%) 11 / 187 (5.9%)
-hint-mistakes 56 / 272 (20.6%) 15 / 187 (8.0%)
-hint-mistakes-attempt1 51 / 272 (18.8%) 10 / 187 (5.3%)
----------------------------------------------------------------- ------------------ --------------
cot 49 / 272 (18.0%) -
pal 17 / 272 (6.2%) -
p2c 37 / 272 (13.6%) -

gsm 1106

prompt Overall Accuracy Success Rate
simple greedy 1081 / 1319 (82.0%) 43 / 196 (21.9%)
SC@15 1126 / 1297 ( 22 api errors ) 261 / 413 (63.2%)
----------------------------------------------------------------- ------------------ --------------
rims_gsm_old 1127 / 1319 (85.4%) 86 / 193 (44.6%)
----------------------------------------------------------------- ------------------ --------------
rims 1122 / 1319 (85.1%) 81 / 193 (42.0%)
-hint 1131 / 1319 (85.7%) 90 / 193 (46.6%)
-hint-mistakes 1122 / 1319 (85.1%) 81 / 193 (42.0%)
-hint-mistakes-attempt1 1103 / 1319 (83.6%) 62 / 193 (32.1%)
+p2c_rewrote (GSM_RIMS) 1127 / 1319 (85.4%) 86 / 193 (44.6%)
SC@15 T=0.2 1151 / 1288 (+9 fails) 286 / 404 (70.8%)
SC@15 T=0.5 1153 / 1285 (+12 fails) 288 / 401 (71.8%)
----------------------------------------------------------------- ------------------ --------------
rims'+p2c_rewrote (cot2p2c.pal2cot.pal2p2c) (GSM_RIMS1) 1103 / 1319 (83.6%) 62 / 193 (32.1%)
SC@15 T=0.2 1150 / 1296 (+1 fails) 285 / 412 (69.2%)
SC@15 T=0.5 1155 / 1285 (+12 fails) 290 / 401 (72.3%)
rims''+p2c_rewrote (pal2p2c.cot2p2c.cot2pal) (GSM_RIMS2) 1113 / 1319 (84.4%) 72 / 193 (37.3%)
SC@15 T=0.2 1143 / 1292 (+5 fails) 278 / 408 (68.1%)
SC@15 T=0.5 1143 / 1292 278 / 408 (68.1%)
----------------------------------------------------------------- ------------------ --------------
cot 921 / 1319 (69.8%) -
pal 1038 / 1319 (78.7%) -
p2c 991 / 1319 (75.1%) -
expand: chatgpt0613long results (fails on ocw) # chatgpt0613long * TL;DR: 2/2 fails on ocw, 1/2 fail on math, 1/1 success on gsm ## ocw * both rims loses (-4, -5 / 272) | Component | Result | |-----------|---------------------| | cot | 52 / 272 (19.1%) | | pal | 35 / 272 (12.9%) | | p2c | 44 / 272 (16.2%) | | simple greedy | 61 / 272 (22.4%) | | rims_gsm_best | 55 / 272 (20.2%) | | rims (p2c-cot.pal-p2c.pal-cot) | 57 / 272 (21.0%) | | -hint | 52 / 272 (19.1%) | | -hint-mistakes | 56 / 272 (20.6%) | | -hint-mistakes-attempt1 | 51 / 272 (18.8%) | | rims_ocw_p2c-cot.pal-p2c.cot-p2c | 56 / 272 (20.6%) | 15 / 187 (8.0%) | | -hint | 57 / 272 (21.0%) | 16 / 187 (8.6%) | | -hint-mistakes | 56 / 272 (20.6%) | 15 / 187 (8.0%) | | -hint-mistakes-attempt1 | 55 / 272 (20.2%) | 14 / 187 (7.5%) | ## math 0613 * one wins (+115/5000), one almost loses (-10/5000) | prompt | Overall Accuracy | Success Rate | |-------------------------------------------|------------------|---------------| | simple greedy | 2120 / 4999 (42.4%) | 354 / 2438 (14.5%) | | rims_gsm_best | 2101 / 4999 (42.0%) | 335 / 2438 (13.7%) | | rims (p2c-cot.pal-p2c.pal-cot) | 2110 / 4999 (42.2%) | 344 / 2438 (14.1%) | | -hint | 2115 / 4999 (42.3%) | 349 / 2438 (14.3%) | | -hint-mistakes | 2104 / 4999 (42.1%) | 338 / 2438 (13.9%) | | -hint-mistakes-attempt1 | 2124 / 4999 (42.5%) | 358 / 2438 (14.7%) | | rims (1) | 2235 / 4999 (44.7%) | 469 / 2438 (19.2%) | | -hint | 2235 / 4999 (44.7%) | 469 / 2438 (19.2%) | | -hint-mistakes | 2219 / 4999 (44.4%) | 453 / 2438 (18.6%) | | -hint-mistakes-attempt1 | 2132 / 4999 (42.6%) | 366 / 2438 (15.0%) | | cot | 1527 / 4999 (30.5%) | - | | pal | 2047 / 4999 (40.9%) | - | | p2c | 1758 / 4999 (35.2%) | - | ## gsm 0613 | prompt | Overall Accuracy | Success Rate | |----------------------------------------------|-------------------|---------------| | simple greedy | 1067 / 1319 (80.9%) | 43 / 195 (22.1%) | | rims_gsm_old | 1112 / 1319 (84.3%) | 85 / 192 (44.3%) | | rims_gsm_new | 1126 / 1319 (85.4%) | 99 / 192 (51.6%) | | -hint | 1102 / 1319 (83.5%) | 75 / 192 (39.1%) | | -hint-mistakes | 1113 / 1319 (84.4%) | 86 / 192 (44.8%) | | -hint-mistakes-attempt1 | 1077 / 1319 (81.7%) | 50 / 192 (26.0%) | | cot | 942 / 1319 (71.4%) | - | | pal | 1056 / 1319 (80.1%) | - | | p2c | 955 / 1319 (72.4%) | - |
fgenie commented 3 months ago

apr 1

chatgpt1106 - math

math 1106

prompt Overall Accuracy Success Rate
simple greedy 2086 / 4999 (41.7%) 361 / 2550 (14.2%)
----------------------------------------------------------------- ------------------ --------------
rims_gsm_old 2192 / 4999 (43.8%) 392 / 2361 (16.6%)
----------------------------------------------------------------- ------------------ --------------
rims (p2c-cot.pal-p2c.pal-cot) 2188 / 4999 (43.8%) 388 / 2361 (16.4%)
-hint 2218 / 4999 (44.4%) 418 / 2361 (17.7%)
-hint-mistakes 2170 / 4999 (43.4%) 416 / 2503 (16.6%)
-hint-mistakes-attempt1 2151 / 4999 (43.0%) 351 / 2361 (14.9%)
----------------------------------------------------------------- ------------------ --------------
rims (1) 2191 / 4999 (43.8%) 391 / 2361 (16.6%)
-hint 2166 / 4999 (43.3%) 366 / 2361 (15.5%)
-hint-mistakes 2177 / 4999 (43.5%) 377 / 2361 (16.0%)
-hint-mistakes-attempt1 2137 / 4999 (42.7%) 382 / 2500 (15.3%)
----------------------------------------------------------------- ------------------ --------------
cot 1644 / 4999 (32.9%)
pal 1900 / 4999 (38.0%)
p2c 1796 / 4999 (35.9%)
fgenie commented 2 months ago

Apr 7

GPT-4-1106-preview results

distinctively effective methods

cot_only pal_only p2c_only
math (5000) 557 104 977
ocw (272) 28 5 42
gsm (1319) 15 13 10

rims vs simple_greedy vs (cot/pal/p2c)


satisfactory result

prompt Overall Accuracy Success Rate
(selection max: 38/41 (92.7%))
simple greedy 1249 / 1319 (94.7%) 13 / 41 (31.7%)
rims_gsm_old 1262 / 1319 (95.7%) 23 / 31 (56.1%)
rims_gsm_newer (remove p2c plan from above) 1259 / 1319 (95.5%) 20 / 41 (48.8%)
rims* (p2c2cot.pal2p2c.pal2cot) 1260 / 1319 (95.5%) 21 / 41 (51.2%)
rims* (pal2p2c.cot2p2c.cot2pal) 1256 / 1319 (95.2%) 17 / 41 (41.5%)
rims* (cot2p2c.pal2cot.pal2p2c) 1259 / 1319 (95.5%) 20 / 41 (48.8%)
cot 1110 / 1319 (84.2%)
pal 1239 / 1319 (93.9%)
p2c 1226 / 1319 (92.9%)

*those are for unifying reformatted p2c format of MATH and ocw_courses


satisfactory result

prompt Overall Accuracy Success Rate
(selection max: 1638/4999 (64.5%))
simple greedy 2126 / 4999 (42.5%) 401 / 2539 (15.8%)
rims_gsm_old 2539 / 4999 (50.8%) 814 / 2539 (32.1%)
rims (p2c-cot.pal-p2c.pal-cot) 2584 / 4999 (51.7%) 859 / 2539 (33.8%)
rims (p2c-cot.pal-p2c.pal-cot) (1) 2597 / 4999 (52.0%) 872 / 2539 (34.3%)
cot 1828 / 4999 (36.6%)
pal 741 / 4999 (14.8%)
p2c 2468 / 4999 (49.4%)

*(1) has different question in fewshot blurb



prompt Overall Accuracy Success Rate
(selection max: 85/157 (54.1%))
simple greedy 69 / 272 (25.4%) 16 / 157 (10.2%)
rims_gsm_old 79 / 272 (29.0%) 26 / 157 (16.6%)
rims (p2c-cot.pal-p2c.pal-cot) 74 / 272 (27.2%) 21 / 157 (13.4%)
rims (p2c-cot.pal-p2c.cot-p2c) 67 / 272 (24.6%) 14 / 157 (8.9%)
cot 61 / 272 (22.4%)
pal 23 / 272 (8.5%)
p2c 78 / 272 (28.7%)
fgenie commented 2 months ago

prompts differences in GSM

The following applies to all prompts in gsm exp: rims, simple-greedy prompts (last one is finally used) For MATH and OCW, p2c plans are sometimes appears implicitly, sometimes explicitly (even though the prompts generated those were all explicit!)

# p2c in gsm_old
{CODE} # code 

# p2c in gsm_newer (that is, "remove plan" above in gpt4 table)
def solution():
    """ docstring usually dropped the plan given in the prompt """
    {CODE} # but the code includes kind of numbered comments.

# p2c in rims* above in gpt4 table
 def solution():
    questions and some explanations

For more, see the prompts below:

fgenie commented 2 months ago

reflection vs selection effect



reflection_effect selection_effect select_p2c select_pal select_cot
simple greedy 0.0 % 100.0 % 29.4 % 8.0 % 62.6 %
rims_gsm_old (p2c2cot.pal2p2c.pal2cot) 90.5 % 9.5 % 0.6 % 0.1 % 8.7 %
rims* (p2c-cot.pal-p2c.pal-cot) (1) 85.1 % 14.9 % 5.2 % 0.2 % 9.5 %
rims* (p2c-cot.pal-p2c.pal-cot) 85.6 % 14.4 % 5.6 % 0.1 % 8.7 %


reflection_effect selection_effect select_p2c select_pal select_cot
simple greedy 0.0% 100.0% 7.7% 38.5% 53.8%
rims_gsm_old (p2c2cot.pal2p2c.pal2cot) 60.9% 39.1% 0.0% 0.0% 39.1%
rims_gsm_old (remove plan) 65.0% 35.0% 0.0% 0.0% 35.0%
rims* (p2c2cot.pal2p2c.pal2cot) 66.7% 33.3% 0.0% 0.0% 33.3%
rims* (pal2p2c.cot2p2c.cot2pal) 64.7% 35.3% 0.0% 0.0% 35.3%
rims* (cot2p2c.pal2cot.pal2p2c) 55.0% 45.0% 10.0% 0.0% 35.0%


reflection_effect selection_effect select_p2c select_pal select_cot
simple greedy 0.0% 100.0% 18.8% 0.0% 81.2%
rims_gsm_old (p2c2cot.pal2p2c.pal2cot) 96.2% 3.8% 0.0% 0.0% 3.8%
rims* (p2c2cot.pal2p2c.pal2cot) 90.5% 9.5% 9.5% 0.0% 0.0%
rims* (p2c2cot.pal2p2c.cot2p2c) 71.4% 28.6% 28.6% 0.0% 0.0%



reflection_effect selection_effect select_p2c select_pal select_cot
simple greedy 0.0 % 100.0 % 3.0 % 13.3 % 83.7 %
rims_gsm_old 67.1 % 32.9 % 5.6 % 17.6 % 9.7 %
rims* p2c-cot.pal-p2c.pal-cot 75.5 % 24.5 % 7.0 % 2.8 % 14.7 %
rims* p2c-cot.pal-p2c.pal-cot (1) 75.2 % 24.8 % 7.9 % 5.4 % 11.5 %
rims* p2c-cot.pal-p2c.pal-cot-hint 75.6 % 24.4 % 6.9 % 2.4 % 15.1 %
rims* p2c-cot.pal-p2c.pal-cot-hint (1) 77.6 % 22.4 % 8.5 % 0.8 % 13.1 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes 27.2 % 72.8 % 5.5 % 9.1 % 58.2 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes (1) 70.6 % 29.4 % 7.2 % 5.8 % 16.4 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes-attempt1 70.7 % 29.3 % 6.6 % 14.0 % 8.8 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes-attempt1 (1) 30.1 % 69.9 % 5.0 % 12.3 % 52.6 %


reflection_effect selection_effect select_p2c select_pal select_cot
simple greedy 0.0 % 100.0 % 2.3 % 2.3 % 95.3 %
rims_gsm_old 79.1 % 20.9 % 3.5 % 1.2 % 16.3 %
rims_gsm_newer 69.1 % 30.9 % 1.2 % 2.5 % 27.2 %
rims_gsm_newer-hint 73.3 % 26.7 % 0.0 % 0.0 % 26.7 %
rims_gsm_newer-hint-mistakes 74.1 % 25.9 % 0.0 % 2.5 % 23.5 %
rims_gsm_newer-hint-mistakes-attempt1 62.9 % 37.1 % 1.6 % 32.3 % 3.2 %
rims* p2c2cot.pal2p2c.pal2cot 69.8 % 30.2 % 9.3 % 2.3 % 18.6 %
rims* pal2p2c.cot2p2c.cot2pal 82.3 % 17.7 % 3.2 % 12.9 % 1.6 %
rims* cot2p2c.pal2cot.pal2p2c 70.8 % 29.2 % 8.3 % 2.8 % 18.1 %


reflection_effect selection_effect select_p2c select_pal select_cot
simple greedy 0.0 % 100.0 % 18.2 % 0.0 % 81.8 %
rims_gsm_old 75.0 % 25.0 % 12.5 % 6.2 % 6.2 %
rims* p2c-cot.pal-p2c.cot-p2c 59.1 % 40.9 % 31.8 % 0.0 % 9.1 %
rims* p2c-cot.pal-p2c.pal-cot 73.3 % 26.7 % 6.7 % 0.0 % 20.0 %
rims* p2c-cot.pal-p2c.cot-p2c-hint 80.0 % 20.0 % 5.0 % 0.0 % 15.0 %
rims* p2c-cot.pal-p2c.pal-cot-hint 80.0 % 20.0 % 0.0 % 0.0 % 20.0 %
rims* p2c-cot.pal-p2c.cot-p2c-hint-mistakes 73.7 % 26.3 % 15.8 % 0.0 % 10.5 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes 86.7 % 13.3 % 6.7 % 0.0 % 6.7 %
rims* p2c-cot.pal-p2c.cot-p2c-hint-mistakes-attempt1 100.0 % 0.0 % 0.0 % 0.0 % 0.0 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes-attempt1 70.0 % 30.0 % 20.0 % 0.0 % 10.0 %
fgenie commented 2 months ago

(승재님) rims의 비용적인 면을 잘 계산해두는게 좋을 것 같다. (chul) (2) 에다가 self-consistency를 적용하는게 좋겠다. simple-greedy + SC (1) 한 것과 성능 비교 (chatgpt) (chul) opensource LLM 실험은 웬만하면 여기서? (기창님이 여유가 없다)

fgenie commented 2 months ago

Apr 13: 쉬어갑니다

Apr 20

fgenie commented 2 months ago

Apr 29: SC results only table

(will update further after math done on Tue)

Self-consistency / Temperature




note that rims (blabla) is just for example combination + orderings.

prompt Overall Accuracy Success Rate
simple greedy 44/272 (16.2%) 11/187 (5.9%)
SC@5 (cotT0.5, palT0.8) 57 / 272 44 / 222
SC@10 69 / 249 (23 failed) 62 / 227
----------------------------------------------------------------- ------------------ --------------
rims (p2c-cot.pal-p2c.cot-p2c) 63 / 272 (23.2%) 22 / 155 (14.2%)
SC@5 T=0.2 66 / 264 53 / 222
SC@5 T=0.5 73 / 264 60 / 222
SC@5 T=0.7 60 / 272 47 / 222 (fail=1)
SC@10 T=0.2 85 / 249 78 / 227
SC@10 T=0.5 82 / 249 75 / 227
----------------------------------------------------------------- ------------------ --------------
rims (p2c-cot.pal-p2c.pal-cot) 57 / 272 (21.0%) 16 / 187 (8.6%)
SC@5 T=0.2 65 / 264 52 / 222
SC@5 T=0.5 63 / 264 50 / 222
SC@5 T=0.7 56 / 272 43 / 222 (19.4%)
SC@10 T=0.2 77 / 249 70 / 227 (1 fail)
SC@10 T=0.5 79 / 249 72 / 227


gpt-3.5-turbo-1106 prompt Overall Accuracy Success Rate
simple greedy 1081 / 1319 (82.0%) 43 / 196 (21.9%)
SC@15 1126 / 1297 ( 22 api errors ) 261 / 413 (63.2%)
----------------------------------------------------------------- ------------------ --------------
rims (newer_best_p2c2cot.pal2p2c.pal2cot) 1122 / 1319 (85.1%) 81 / 193 (42.0%)
SC@15 T=0.2 1151 / 1288 (+9 fails) 286 / 404 (70.8%)
SC@15 T=0.5 1153 / 1285 (+12 fails) 288 / 401 (71.8%)
----------------------------------------------------------------- ------------------ --------------
rims (cot2p2c.pal2cot.pal2p2c) (GSM_RIMS1) 1103 / 1319 (83.6%) 62 / 193 (32.1%)
SC@15 T=0.2 1150 / 1296 (+1 fails) 285 / 412 (69.2%)
SC@15 T=0.5 1155 / 1285 (+12 fails) 290 / 401 (72.3%)
rims (pal2p2c.cot2p2c.cot2pal) (GSM_RIMS2) 1113 / 1319 (84.4%) 72 / 193 (37.3%)
SC@15 T=0.2 1143 / 1292 (+5 fails) 278 / 408 (68.1%)
SC@15 T=0.5 1143 / 1292 278 / 408 (68.1%)
fgenie commented 2 months ago

Apr 30

Somethings going super wrong (SC@5 << T=0 n=1)

math 1106

prompt Overall Accuracy Success Rate
simple greedy 2086 / 4999 (41.7%) 361 / 2550 (14.2%)
SC@5, cotT=0.5 / palT=0.8 439 / 4904 (9.0%) 318 / 3684 (8.6%)
----------------------------------------------------------------- ------------------ --------------
rims (p2c-cot.pal-p2c.pal-cot) 2188 / 4999 (43.8%) 388 / 2361 (16.4%)
SC@5, T=0.2 536 / 4897 (10.9%) 415 / 3677 (11.3%)
SC@5, T=0.5
----------------------------------------------------------------- ------------------ --------------
rims (1) 2191 / 4999 (43.8%) 391 / 2361 (16.6%)
SC@5, T=0.2
SC@5, T=0.5
fgenie commented 1 month ago

May: panic!

fgenie commented 1 month ago

Jun 2

@strutive07 @fgenie chul

fgenie commented 1 week ago

Jun 23

refactored gsm8k results

Phi3 small


Llama3 8B


chul: openllm 은 위 둘을 사용하는걸로

new model

Claude-3.5-sonnet 견적 내보기 (https://www.computerworld.com/article/2472913/anthropic-claude-3-5-sonnet-is-here-and-its-free.html)

병행 구현
