fgenie commented 9 months ago

Feb 11

Submission deadline

~~ACL~~ --> arXiv 후 Neurips

OCW low perf:

struggling with symbolics

symbolic 성능 개선 프롬프트 시도
Minerva example shot 활용하여 보강

Click: symbolic v. numeric perf.

chatgpt
nonconflict_numeric
30/82
nonconflict_symbolic
0/28
conflict_rims_numeric
9/108
conflict_rims_symbolic
0/52
conflict_base_numeric
6/109
conflict_base_symbol
0/53
gpt4turbo
nonconflict_numeric
52/105
nonconflict_symbolic
1/39
conflict_rims_numeric
10/76
conflict_rims_symbolic
0/41
conflict_base_numeric
7/72
conflict_base_symbol
0/36

openLLM 실험

LLM:

deepseek-math
llama70Bchat
context length problem:
- prompt -> shot을 자르거나
- llmlingua
- 혹은 left-truncate

새 데이터?

(svamp는 gpt4 에서 saturate)

https://huggingface.co/datasets/camel-ai/math?row=1
https://tiger-ai-lab.github.io/MAmmoTH/ (사용중인 데이터랑 유래가 겹치는 것으로 보임)

To Report

must

total performance

openai model
opensourceLLM

selection success rate (selection 방법론 비교 관점)

교집합...cot pal p2c가 서로 경향이 다름을 보여줘야함

reflection-less (ablation)

good to report

performance breakdown to categories (ocw, math)
self-consistency vs rims under optimal T

fgenie commented 9 months ago

related papers

reasoning with LLM

model selection reasoning https://aclanthology.org/2023.findings-emnlp.55.pdf
contrastive cot https://arxiv.org/pdf/2311.09277.pdf
reflexion https://arxiv.org/pdf/2303.11366.pdf
php https://arxiv.org/pdf/2304.09797.pdf
possibly be related to...
self-discover https://arxiv.org/pdf/2402.03620.pdf
tot https://arxiv.org/pdf/2305.10601.pdf
aot https://arxiv.org/pdf/2308.10379.pdf
others
calibration https://arxiv.org/pdf/2311.09553.pdf
metamath https://arxiv.org/pdf/2309.12284.pdf
llmlingua https://arxiv.org/pdf/2310.05736.pdf

fgenie commented 9 months ago

지금 rims prompt 는 아래 둘을 합쳐놓은 방법이다.

selection 여러가지 방법 중 선택해서 진행한다는 점에서
reflection (or feedback) 앞의 답을 보고 틀린 것을 맞추도록 푼다는 점에서

이것들을 드러내려면 어떤 것들을 살펴볼 수 있을까?

model selection과 선택 경향을 비교
만약 reflection 후에 같은 reasoning method를 택한다면 (e.g. CoT -> CoT) 어떻게 답이 달라지는가?
혹은 위와 같은 경우가 없다면 model selection과 같은 방향을 선택했을 때에 model selection은 틀리고 rims는 맞는 경우들을 세볼 수 있을 것이다.
추가로 둘이서 conflict set 내의 맞추는 경우 못맞추는 경우가 어떤 식으로 나뉘는지 보면 좋겠다.

그 외에 부가적 분석으로

MATH 카테고리별 분석이 있을 수 있다.
OCW, MATH 용 프롬프트를 마련해서 둘에서 전반적으로 더 나은 성능을 기록한 상태로 위의 내용을 분석해볼 수 있다.

fgenie commented 9 months ago

어떤 문제를 풀 수 있는 방법이 한가지로 특정되는 상황에서 rims가 잘 되는경향이 있다면 최고 좋긴 한데

생각보다 cot pal이 어떤 경우에나 적용이 가능한 것 같아 보이기도 한다. 특정 형태에 대해 불가능을 상정하고 실험을 해야하는걸까

symbolic이 ocw에서는 잘 안되는 것으로 보이니, cot를 symbolic 담당시키고 나머지로 numeric을 풀도록 해보자 (물론 CoT도 numeric을 잘 할 수 있으나, PAL로 충분하다)

fgenie commented 9 months ago

Failure in symbolic problem solving (OCW)

both RIMS / model-selection-reasoning baseline prompts from GSM examples utterly fails on symbolic reasoning. (=v3 prompt, currently best-performing prompt) prompt_construction_src/prep_rims_prompts/gsm_prompts/3_reflectonce_p2c2cot.pal2p2c.pal2cot.txt_rm_ans

chatgpt0613long	numeric	symbolic
non-conflict	30/82	0/28
conflict (baseline)	6/109	0/53
conflict (rims)	9/108	0/52

*(missing 1 numeric and 1 symbolic = / in rims results are failed to parse)

gpt4turbo	numeric	symbolic
non-conflict	52/105	1/39
conflict (baseline)	7/72	0/36
conflict (rims)	10/76	0/41

(missing 14 numerics and 6 symbolics in baseline / 20 missings are failed to parse ) (missing 10 numerics and 1 symbolics in rims / 11 are failed to parse )

OCW type numeric 191 expression (tex) 69 equation 12

fgenie commented 9 months ago

Math analyses

1: According to MATH-type

In short, type-wise analyses shows,

RIMS v3 >= model-selection-baseline ~ PAL, P2C > CoT

but in aggregate,

RIMS v3 >= model-selection-baseline > PAL, P2C, CoT

RIMS prompt v3 (originated from GSM train examples)

significantly improves results in
- Number Theory
- Prealgebra
has no effect in
- Geometry
elsewhere, RIMS v3 somewhat improves the performance over baselines
No single method surpasses feedback method, while in some, PAL or P2C surpasses model-selection baseline**
- number theory
- counting

Full Table

MATH-types	method	in-total perf. (cot;pal;p2c) (method+majority_vote)	%	conflict (method)	%	non-conflict (majority_vote)	%
Geometry	baseline	102 (91;79;81) / 479	21.3% (19.0;16.5;16.9 %)	28 / 275	10.2%	74 / 544	13.6%
	rims	102 / 479	21.3%	28 / 275	10.2%*
Number Theory	baseline	279 (127;303;294) / 540	51.7% (23.5;56.1;54.4 %)	10 / 208	4.8%	269 / 544	49.4%
	rims	314 / 540	58.1%	45 / 208	21.6%
Prealgebra	baseline	505 (369;473;463) / 871	58.0% (42.4;54.3;53.2 %)	35 / 247	14.2%	470 / 544	86.4%
	rims	529 / 871	60.7%	59 / 247	23.9%
Algebra	baseline	585 (423;508;551) / 1187	49.3% (35.6;42.7;46.4 %)	78 / 481	16.2%	507 / 544	93.2%
	rims	606 / 1187	51.1%	99 / 481	20.6%
Counting & Probability	baseline	169 (106;184;163) / 474	35.7% (22.4;38.8;34.4 %)	13 / 223	5.8%	156 / 544	28.7%
	rims	182 / 474	38.4%	26 / 223	11.7%
Intermediate Algebra	baseline	137 (81;116;129) / 901	15.2% (9.0;12.8;14.3 %)	27 / 517	5.2%	110 / 544	20.2%
	rims	143 / 901	15.9%	33 / 516	6.4%*
Precalculus	baseline	54 (35;41;45) / 544	9.9% (6.4;7.5;8.3 %)	11 / 309	3.6%	43 / 544	7.9%
	rims	56 / 544	10.3%	13 / 309	4.2%*

aggregate

	/math_full_0613long/
baseline	1831/4996 (36.65 \%)
rims	1932/4996 (38.67 \%)
rims_abl	1769/4996 (35.41 \%)
cot	1232/4996 (24.7%)
pal	1704/4996 (34.1%)
p2c	1726/4996 (34.5%)

fgenie commented 9 months ago

2: According to MATH-level

always rims > model selection > pal p2c > cot
least improvements in (but improvements are not in order)
- lv 5
weird thing
- why non-conflict (majority_voted) struggles in lv 1?

Level	method	Conflict (method)	Non-Conflict (majority_vote)	In-Total (method+majority_vote)	COT	PAL	P2C
1	Baseline	14/90 (15.6%)	281/1324 (21.2%)	295/437 (67.5%)	229/437 (52.4%)	269/437 (61.6%)	285/437 (65.2%)
1	Rims	20/90 (22.2%)	-	301/437 (68.9%)	-	-	-
2	Baseline	38/292 (13.0%)	439/1324 (33.2%)	477/893 (53.4%)	344/893 (38.5%)	436/893 (48.8%)	456/893 (51.1%)
2	Rims	59/292 (20.2%)	-	498/893 (55.8%)	-	-	-
3	Baseline	42/483 (8.7%)	425/1324 (32.1%)	467/1128 (41.4%)	300/1128 (26.6%)	454/1128 (40.2%)	438/1128 (38.8%)
3	Rims	87/483 (18.0%)	-	512/1128 (45.4%)	-	-	-
4	Baseline	54/591 (9.1%)	321/1324 (24.2%)	375/1214 (30.9%)	235/1214 (19.4%)	339/1214 (27.9%)	344/1214 (28.3%)
4	Rims	76/591 (12.9%)	-	397/1214 (32.7%)	-	-	-
5	Baseline	54/804 (6.7%)	163/1324 (12.3%)	217/1324 (16.4%)	124/1324 (9.4%)	206/1324 (15.6%)	203/1324 (15.3%)
5	Rims	61/803 (7.6%)	-	224/1324 (16.9%)	-	-	-

fgenie commented 9 months ago

구성비로 설명이 되는 것 같진 않음

	Level 1	Level 2	Level 3	Level 4	Level 5
Algebra	135 (30.89%)	201 (22.51%)	261 (23.14%)	283 (23.31%)	307 (23.19%)
Counting & Probability	39 (8.92%)	101 (11.31%)	100 (8.87%)	111 (9.14%)	123 (9.29%)
Geometry	38 (8.7%)	82 (9.18%)	102 (9.04%)	125 (10.3%)	132 (9.97%)
Intermediate Algebra	52 (11.9%)	128 (14.33%)	193 (17.11%)	248 (20.43%)	280 (21.15%)
Number Theory	30 (6.86%)	92 (10.3%)	122 (10.82%)	142 (11.7%)	154 (11.63%)
Prealgebra	86 (19.68%)	177 (19.82%)	224 (19.86%)	191 (15.73%)	193 (14.58%)
Precalculus	57 (13.04%)	112 (12.54%)	126 (11.17%)	114 (9.39%)	135 (10.2%)

fgenie commented 9 months ago

geometry ?
~~level 1 ?~~

fgenie commented 9 months ago

방법별 겹치는 정도에 대한 해석은 다시 생각해봐야할 것 같네요. 수정후에 다시 공유드리겠습니다.

How unique each method is?

MATH에서 문제 유형별로 각 방법이 얼마나 맞추는지, 그 중 unique하게 풀 수 있는 비율은 얼마인지 세보았습니다. 결론은, 정답률이 높은 유형일 수록 uniqueness는 떨어집니다.

MATH의 question level이 올라갈 수록 정답률은 떨어지고 method-unique correct 의 비중은 올라갑니다.
MATH에서 어려움을 겪는 문항타입 (e.g.intermediate algebra) 의 경우 method-unique correct 의 비중은 올라갑니다. vice versa in prealgebra.

corrects_lvwise_MATH corrects_typewise_MATH

expand: datapoints

Levels: Level 1 437 * cot_corrects: 23 / 229 (10.0) * pal_corrects: 15 / 269 (5.6) * p2c_corrects: 23 / 285 (8.1) Level 2 893 * cot_corrects: 59 / 344 (17.2) * pal_corrects: 60 / 436 (13.8) * p2c_corrects: 62 / 456 (13.6) Level 3 1128 * cot_corrects: 56 / 300 (18.7) * pal_corrects: 93 / 454 (20.5) * p2c_corrects: 71 / 438 (16.2) Level 4 1214 * cot_corrects: 75 / 235 (31.9) * pal_corrects: 70 / 339 (20.6) * p2c_corrects: 77 / 344 (22.4) Level 5 1324 * cot_corrects: 70 / 124 (56.5) * pal_corrects: 71 / 206 (34.5) * p2c_corrects: 74 / 203 (36.5) Types: Algebra: 1187 * cot_corrects (423/1187, 35.6\%): 99 / 423 (23.4\%) * pal_corrects (508/1187, 42.8\%): 77 / 508 (15.2\%) * p2c_corrects (551/1187, 46.4\%): 100 / 551 (18.1\%) Counting & Probability: 474 * cot_corrects (106/474, 22.4\%): 19 / 106 (17.9\%) * pal_corrects (184/474, 38.8\%): 38 / 184 (20.7\%) * p2c_corrects (163/474, 34.4\%): 24 / 163 (14.7\%) Geometry: 479 * cot_corrects (91/479, 19.0\%): 37 / 91 (40.7\%) * pal_corrects (79/479, 16.5\%): 21 / 79 (26.6\%) * p2c_corrects (81/479, 16.9\%): 18 / 81 (22.2\%) Intermediate Algebra: 901 * cot_corrects (81/901, 9.0\%): 37 / 81 (45.7\%) * pal_corrects (116/901, 12.9\%): 47 / 116 (40.5\%) * p2c_corrects (129/901, 14.3\%): 49 / 129 (38.0\%) Number Theory: 540 * cot_corrects (127/540, 23.5\%): 16 / 127 (12.6\%) * pal_corrects (303/540, 56.1\%): 62 / 303 (20.5\%) * p2c_corrects (294/540, 54.4\%): 55 / 294 (18.7\%) Prealgebra: 871 * cot_corrects (369/871, 42.4\%): 52 / 369 (14.1\%) * pal_corrects (473/871, 54.3\%): 47 / 473 (9.9\%) * p2c_corrects (463/871, 53.2\%): 41 / 463 (8.9\%) Precalculus: 544 * cot_corrects (35/544, 6.4\%): 23 / 35 (65.7\%) * pal_corrects (41/544, 7.5\%): 17 / 41 (41.5\%) * p2c_corrects (45/544, 8.3\%): 20 / 45 (44.4\%)

~~# Interpretation & plan~~ 그래서 현재의 cot, pal, p2c (with gsm examples) 는 서로 완전히 동치는 아닙니다만

~~selection에 의해 성능이 개선됨이 두드러지지 않는 세팅이며 improvement solely by feedback 이라고 주장할 수도 있습니다.~~
- ~~code reasoning이 장점을 보이는 분야: geometry를 제외한 모든 타입~~
- ~~language reasoning이 장접을 보이는 분야: 나머지~~

방법별 10% 내외의 정답률 차이를 각 방법의 이점이라 생각한다면 이에 대해서 방법 선택을 하는 빈도를 세볼 수 있기는 합니다.

방법별 차이를 더 두드러지게 하고 싶다면 다음의 실행적 방안이 있습니다.

각 방법별 prompt few shot example (혹은 rims blurb)을 gsm 통일이 아니라 각자 다른 형태로 분화시키는 것입니다. 예를 들면
- cot - from math symbolic examples
- pal - gsm arithmetics
- p2c - ocw symbolics 이렇게하여 선택의 이점을 더 강화시킬 수 있다면 논문에서 제안할 수 있는 이야기가 더 많을 것입니다.
이 때 걸리적거릴 수 있는 부분은 SOTA performance에 대한 얘기입니다. model selection을 위처럼 기존과 다른 세팅으로 진행할 경우, 우리 방법론에 최적화된 실험을 위해 이를 수행했다는 의심을 살 수 있습니다. 그래서, model selection과의 비교는 cot + pal 2 method with only gsm prompts (Model Selection Reasoning과 같은 세팅) 으로 이 걱정을 덜 수 있습니다.
사실 p2c와 pal 은 이렇게라도 하지 않으면 본질적으로 그 특징이 비슷할 것으로 여겨집니다 (docstring을 안에 쓰냐 밖에 쓰냐의 차이). p2c대신 self-discover와 같은 방법으로 대체해보는 방법도 있을 것 같습니다 (아직 self-discover를 자세히 살펴보진 못했습니다).

fgenie commented 9 months ago

separating selection and feedback effect (chatgpt, MATH)

MATH = 4996 rows

majority_vote = 1629 / 4996 (32.6 %)
below counts are conflict cases where selection takes place
- adds up to make total performance

	selection effect	feedback_effect	in total
model_selection	202 (4.0 %p)	0	202
rims	95 (1.9 %p)	208 (4.2 %p)	303 (6.1 %p)
upperbound	722	-	722 (14.5 %p)

upperbound = oracle selector (if there is a correct solution by any method, choose it)
208 feedback effect breaks into
- 88 from examples that all three method fails
- 120 from examples that at least one method success, but chooses another method.

fgenie commented 9 months ago

Feb 18

CoT, PAL, P2C는 겹치는가? GSM예시를 동일하게 적용했지만 많이 겹치지 않는다 (아래는 MATH에서의 correct).
- ('cot', 1232, 'pal', 1704, 'p2c', 1726,
- 'cotpal', 791, 'palp2c', 1261, 'cotp2c', 815,
- 'cotpalp2c', 657)
SVAMP 버리기
(chatgpt) OCW rims v3 < PAL only OCW 링크
OCW를 버리지 않으려면, 성능이 개선되어야 한다.
- P2C의 example 프롬프트를 바꿔서 symbolic한 쪽을 타겟한다.
- cheating으로 여겨지지 않으려면 어떻게 해야할까?
open source LLM 실험: chat_template이 없어서 openAI SDK 형태에 연동되도록 구현하지 못한 상태이고 TGI 서버가 마련되어있다.
- Completion API형태로 LLaMA 13B 를 현재 코드에서 request 해보자.

fgenie commented 9 months ago

Feb 25

실험 세팅 뒤엎기 진행 중
- PAL, P2C는 논문에 활용된 prompt 들로 교체
- CoT 역시 논문에 활용된 OCW, MATH fewshot들을 활용하여 각각 실험
feedback 효과가 selection 효과보다 큰 게 걱정
- single < feedback-less < model-selection baseline < rims 이므로 괜찮아 보임
- 어쨌건 추가결과에 가까우니 큰 결과를 뽑고 진행

fgenie commented 9 months ago

OCW prompt 가공중에

ocw evaluation 이 이상하다.

is_equiv_ocw cannot parse and check the answer from the provided prompt's answers. Why --> flawed parsing and equivalence logic. (surprisingly from the author's code)

tried workaround with the author's setting but fails
to make more parsable and evaluatables in 4 shot examples, I change the script to do the below

OCW result re-measured with is_equiv_ocw modified to normalize_symbolic_exp + is_equiv_exp from normalize_final_answer + is_equiv_tex

cot

27 / 272 (9.9%)

pal

38 / 272 (14.0%)

p2c

36 / 272 (13.2%)

mostly the same results... this does not explain anything about my modification to the original eval code single-handedly. I should check for each configuration

how many symbolics considered unparsable
how many numerics considered parsable

fgenie commented 8 months ago

Mar 10

어제 Azure endpoint 태형님이랑 확인한거, 그리고 ocw scoring 함수 고쳐서 테스트해보았습니다.

ocw는 수식 "\left" 가 자꾸 le 로 바뀌는 것 때문에 (parse_latex 에 문제를 일으킴) 고쳤고 확인은 했는데 이게 GSM 프롬프트로 풀었던 답안들의 정오 판단에 영향을 미치진 않은 것 같아요 (점수가 전후 똑같음). 그리고 대부분의 symbolic expression / equation들이 제대로 채점되기 어렵다는 점만 파악했습니다 (fails to evaluate 42/273).
- 이전 실험결과 두가지 is_equiv_ocw()로 테스트, 결과 동일
Azure endpoint 지금은 chatgpt 날짜-버전별로 사용이 가능하게 세팅하신거 테스트까지 했습니다.

test_equiv_f.py

// eval_new correct (3)
{"answer": "x_{0} \\cos (\\omega t)+$ $\\dot{x}_{0} \\sin (\\omega t) / \\omega", "artificial_wrong": "1+x_{0} \\cos (\\omega t)+$ $\\dot{x}_{0} \\sin (\\omega t) / \\omega", "eval": true, "eval_new": false}
{"answer": "\\frac{1}{b-a}\\left(e^{-a t}-e^{-b t}\\right)", "artificial_wrong": "1+\\frac{1}{b-a}\\left(e^{-a t}-e^{-b t}\\right)", "eval": "EVAL_FAIL! cannot determine truth value of Relational", "eval_new": false}
{"answer": "m_{p} c^{2}\\left(\\gamma^{2}-1\\right) \\sin ^{2} \\theta", "artificial_wrong": "1+m_{p} c^{2}\\left(\\gamma^{2}-1\\right) \\sin ^{2} \\theta", "eval": "EVAL_FAIL! cannot determine truth value of Relational", "eval_new": false}

// both wrong (42)
{"answer": "4.5e33", "artificial_wrong": "1+4.5e33", "eval": true, "eval_new": true}
{"answer": "3.83e35", "artificial_wrong": "1+3.83e35", "eval": true, "eval_new": true}
{"answer": "8.7e8", "artificial_wrong": "1+8.7e8", "eval": true, "eval_new": true}
{"answer": "4e33", "artificial_wrong": "1+4e33", "eval": true, "eval_new": true}
{"answer": "3.3e12", "artificial_wrong": "1+3.3e12", "eval": true, "eval_new": true}
{"answer": "3e6", "artificial_wrong": "1+3e6", "eval": true, "eval_new": true}
{"answer": "7e37", "artificial_wrong": "1+7e37", "eval": true, "eval_new": true}
{"answer": "7.5e7", "artificial_wrong": "1+7.5e7", "eval": true, "eval_new": true}
{"answer": "2e27", "artificial_wrong": "1+2e27", "eval": true, "eval_new": true}
{"answer": "2.75e11", "artificial_wrong": "1+2.75e11", "eval": true, "eval_new": true}
{"answer": "6e13", "artificial_wrong": "1+6e13", "eval": true, "eval_new": true}
{"answer": "4.4e7", "artificial_wrong": "1+4.4e7", "eval": true, "eval_new": true}
{"answer": "3e8", "artificial_wrong": "1+3e8", "eval": true, "eval_new": true}
{"answer": "1e11", "artificial_wrong": "1+1e11", "eval": true, "eval_new": true}
{"answer": "400000", "artificial_wrong": "1+400000", "eval": true, "eval_new": true}
{"answer": "5.47e5", "artificial_wrong": "1+5.47e5", "eval": true, "eval_new": true}
{"answer": "2.19e6", "artificial_wrong": "1+2.19e6", "eval": true, "eval_new": true}
{"answer": "1.87e6", "artificial_wrong": "1+1.87e6", "eval": true, "eval_new": true}
{"answer": "4.45e15", "artificial_wrong": "1+4.45e15", "eval": true, "eval_new": true}
{"answer": "9e11", "artificial_wrong": "1+9e11", "eval": true, "eval_new": true}
{"answer": "7.353e14", "artificial_wrong": "1+7.353e14", "eval": true, "eval_new": true}
{"answer": "1.39e9", "artificial_wrong": "1+1.39e9", "eval": true, "eval_new": true}
{"answer": "9.35e5", "artificial_wrong": "1+9.35e5", "eval": true, "eval_new": true}
{"answer": "2.88e16", "artificial_wrong": "1+2.88e16", "eval": true, "eval_new": true}
{"answer": "7.26e6", "artificial_wrong": "1+7.26e6", "eval": true, "eval_new": true}
{"answer": "1.85e5", "artificial_wrong": "1+1.85e5", "eval": true, "eval_new": true}
{"answer": "4.46e19", "artificial_wrong": "1+4.46e19", "eval": true, "eval_new": true}
{"answer": "3.21e13", "artificial_wrong": "1+3.21e13", "eval": true, "eval_new": true}
{"answer": "2.45e6", "artificial_wrong": "1+2.45e6", "eval": true, "eval_new": true}
{"answer": "7.02e5", "artificial_wrong": "1+7.02e5", "eval": true, "eval_new": true}
{"answer": "3.75e9", "artificial_wrong": "1+3.75e9", "eval": true, "eval_new": true}
{"answer": "1.07e16", "artificial_wrong": "1+1.07e16", "eval": true, "eval_new": true}

others the same (272-42-3)

fgenie commented 8 months ago

CoT 결과 처리 수정

ocw parsing / evaluation 문제는 다음 merge로 해결하였습니다. https://github.com/fgenie/rims_minimal/pull/40#issue-2186864593

TL;DR

새로 구현한 MATH, OCW 용 CoT solution 파싱 함수utils.llm_query_utils.extract_ans_from_cot_MATHnOCW 를 구현하고 채점해본 결과 GSM fewshot을 통해서 얻어낸 CoT 솔루션들에서 다음과 같은 변화가 관찰됨. 앞으로 이 파싱함수로 실험을 진행함.
프롬프트 준비하면서 utils.math_util.is_equiv를 조금 더 손 본 이후 (맨 앞에 string exact match를 추가하여 보강) 한 뒤로 채점기가 좀더 믿을만해짐. (update)

Metric	Math	OCW
`old_acc`	0.247	0.099
`new_acc`	0.274 (+ 2.7%p)	0.195 (+ 9.6%p)
`delta correct`	(+ 138 / 4996)	(+26 / 272)

fgenie commented 8 months ago

PAL/P2C (코드) 실행 결과 처리 수정

pal / p2c 결과 수집 중에 interpreter에서 실행된 sympy 결과가 제대로 다뤄지지 못하는 것을 발견하여 처리함 (draft pr)
sp.latex(code_return) 처리가 gsm과 같은 arithmetic 결과는 영향을 주지 않고 ocw, math의 경우에는 필요한 처리를 하는 것을 확인함

GSM: (no net change) / 6 None's OCW: (24 net change) / 82 rows change over 272 / 58 None's MATH: (147 net change) /. 1318 rows change over 4996 / 1172 None's

fgenie commented 8 months ago

진행 현황은 여기에... 작업브랜치 todolist 이제 체크박스가 얼마 남지 않았음

cot / pal / model selection / rims 의 fewshot은 dataset-adaptive로 하고 p2c만 원 논문의 MBPP fewshot을 사용하게 될 것

fgenie commented 8 months ago

Mar 17

We are in hurry: Till when would the main result be delivered? 정선생님이라도 얼른 도와서 좋은 결과 내주면 좋겠어요... 망치기엔 너무 오랜 시간을 들였습니다 (혼자 일한 기간이 너무 많아서 쉽진 않아보입니다... 일단 제가 마무리 짓는걸로)

fgenie commented 8 months ago

바뀐 채점 코드로 생기는 이전 결과의 변동

아래의 이전 메인 결과(used gsm prompt v3)를 변경된 채점 코드로 채점한 경우 https://github.com/fgenie/rims_minimal/issues/35#issue-2123202439

eval fix effect only	model_selection	rims
gsm	1087/1319 (-)	1114/1319 (-)
ocw	36/272 (-)	39/272 (-)
math	1832/4996 (+1)	1936/4996 (+4)

be cautious!

evaluation change leads to change in majority vote result (to select / rims or not), but this is not applied here
parsing change is not revealed in this table
- worst scenario: merit of rims comes from the ease of parsing
code execution post-processing (if possible, convert to latex and eval) is also not applied here.

fgenie commented 8 months ago

OVERLAPS + Individual performance (chatgpt0613long)

prompt = GSM_OLD_BEST (only gsm fewshots)

dataset = gsm (total 1319)
{'all': 829,
'cot_only': 62,
'p2c_only': 32,
'pal_only': 67,
'cotpal-p2c': 85,
'p2ccot-pal': 54,
'palp2c-cot': 73}

single perf = 
'cot': 1030,
'p2c': 988,
'pal': 1054,

dataset = math (total 5000)
{'all': 657,
'cot_only': 283,
'p2c_only': 322,
'pal_only': 317,
'p2ccot-pal': 158,
'cotpal-p2c': 134,
'palp2c-cot': 610}

'cot': 1232,
'pal': 1718,
'p2c': 1747,

dataset = ocw (total 272)

{'all': 6,
'cot_only': 13,
'pal_only': 14,
'p2c_only': 10,
'cotpal-p2c': 3,
'p2ccot-pal': 5,
'palp2c-cot': 15}

'cot': 27,
'pal': 38,
'p2c': 36,

fgenie commented 8 months ago

Mar 24

https://llm4a.slack.com/archives/C05FKA9C85P/p1711298901583759

Preview: 가져오기로 한 math의 결과가 아래 있습니다. 나머지 결과는 종합되는대로 올려드리겠습니다. 아래와 같은 관점에서 정리할 예정입니다. 이전 결과와 비교하여 어떤 변화가 있는지 (아래 math-baseline 결과에서 미리 보실 수 있습니다) 각 method (cot, pal, p2c) 가 최종 답변 기준으로 서로 얼마나 다른 행동을 보이는지 rims > baseline 인지 rims > ablations (-hint / -hint-mistakes / -hint-mistakes-1st attempt) 인지 gsm CoT 에 약간의 버그를 확인해서 고쳤습니다. 이제 gsm CoT가 제대로 된 점수를 냅니다. tex 로 된 정답을 채점하는 것은 굉장히 성공률이 낮아보입니다. 원래 minerva에서도 그렇고 수정을 거친 지금도 조금은 나아졌지만 마찬가지입니다.
추가로, rims prompting에 4k가 넘는 long context가 필요하지 않은 것을 확인했습니다. 저번에는 실험과정에 max_token을 과도하게 잡았었기 때문에 16k 모델을 사용하는 것이 강제되었는데, 4k context 안쪽으로 input+output을 커버하는 것을 확인했습니다. 그러나 직전 결과와 비교를 위해 이번 실험은 같은 gpt-3.5-turbo-0613-16k 를 활용합니다. 추후의 실험은 그럴 필요는 없습니다.

math baseline result model = chatgpt0613long temperature = 0

individual performance cot: 1527 / 4999 (30.5%) (former: 24.7%) pal: 2047 / 4999 (40.9%) (former: 34.1%) p2c: 1758 / 4999 (35.2%) (former: 34.5%)

overall performance model-selection-reasoning overall_acc: 2120 / 4999 (42.4%) success_rate: 354 / 2438 (14.5%) 4999 (total) = 2438 (seleciton)

2561 (maj-votes)
3 (fails: counted as incorrect)

former = GSM fewshot을 활용하여 진행했던 마지막 결과와 비교하기 위해서 아래와 같은 조건은 고정하였습니다 model = chatgpt 0613 long temperature cot pal p2c 성능 및 model selection baseline 이 5%p 정도씩 높아졌습니다. 여기에 영향을 준 요인들은 아래와 같습니다. cot, pal 의 fewshot: (gsm-8shots --> minerva-math-4shots) p2c 의 fewshot: (gsm-8shots --> MBPP 8 shots = plan2code원 논문 프롬프트) 프롬프트의 형태 또한 이전에서 바뀌어서, LLM 생성된 답변에 planning과정이 명시적으로 보이지 않는 경우도 있음, 그러나 이것이 원래 p2c 논문에서 수행하는 것과 같음 math 의 parsing function (cot 결과에 영향을 줌) math 의 evaluation function (결과 점수, 및 majority voting 과정에서도 영향을 줌)

fgenie commented 8 months ago

mar 27

gsm (+4.5%p +59/1319), math (+2.3%p, +115/5000) 는 rims > simple greedy 를 만족하는데 ocw (-1.4%p, -4/272) 는 rims ~< simple greedy 가 나오네요. 프롬 몇 개 더 해보면 괜찮게 나오는게 하나는 있을 것 같습니다. ocw는 모수가 작아서요

*math 프롬프트는 두 개 중 하나만 위를 만족, gsm은 하나만 해봤는데 해결, ocw는 두 개까지만 시도해봄

fgenie commented 8 months ago

mar 30 (updated: apr 1)

OVERLAPs between methods

chatgpt 1106

cot: gsm-8-shot / ocw-4-shot / math-4-shot
pal: gsm-8-shot / ocw-4-shot / math-4-shot
pal: mbpp-8-shot

Distinct	GSM	OCW	Math
cot_only	57 (4.77%)	31 (39.74%)	536 (17.81%)
pal_only	53 (4.44%)	5 (6.41%)	445 (14.79%)
p2c_only	44 (3.69%)	19 (24.36%)	299 (9.94%)

chatgpt0613 overlaps

# OVERLAPS * model: `chatgpt0613long` * gsm overlap does not change much * ocw possible pool increase 9.5% among total (+ 26/272), cot_only (12->28) and p2c_only (5->17) increases significantly. pal has minor change * possible pool in math also increases 9.3% (+466/5000), cot_only (283->481) and pal_only (317->485) increases significantly. p2c_only (322->265) decreases significantly. ## after applying fewshot change * cot: gsm-8-shot / ocw-4-shot / math-4-shot * pal: gsm-8-shot / ocw-4-shot / math-4-shot * pal: mbpp-8-shot | Distinct | GSM | OCW | Math | |------------|----------------|----------------|----------------| | cot_only | 55 (4.60%) | 28 (30.77%) | 481 (16.32%) | | pal_only | 83 (6.94%) | 12 (13.19%) | 485 (16.46%) | | p2c_only | 33 (2.76%) | 17 (18.68%) | 265 (8.99%) | | **cot or pal or p2c ** | **1196** | **91** | **2947** | ## previous result (before prompt-few shot diversifying) * cot: gsm-8-shot * pal: gsm-8-shot * p2c: gsm-8-shot | Distinct | GSM | OCW | Math | |------------|----------------|----------------|----------------| | cot_only | 62 (5.16%) | 12 (18.46%) | 283 (11.41%) | | pal_only | 67 (5.57%) | 15 (23.08%) | 317 (12.78%) | | p2c_only | 32 (2.66%) | 5 (7.69%) | 322 (12.98%) | | **cot or pal or p2c ** | **1202** | **65** | **2481** |

fgenie commented 8 months ago

mar 30 (1)

chatgpt1106

TL;DR: 2/2 success on ocw, 1/1 success on gsm, (math is ongoing)

ocw 1106

rims (23.2 %)
> rims' (21 %)
- > simple greedy (16.2 %)
- > max( individuals) (18.0 %)

prompt	Overall Accuracy	Success Rate
simple greedy	44/272 (16.2%)	11/187 (5.9%)
simple greedy + SC@5 (cotT0.5, palT0.8)	57 / 272	44 / 222
-----------------------------------------------------------------	------------------	--------------
rims_gsm_old	55 / 272 (20.2%)	14 / 187 (7.5%)
-----------------------------------------------------------------	------------------	--------------
rims (p2c-cot.pal-p2c.cot-p2c)	63 / 272 (23.2%)	22 / 155 (14.2%)
SC@5 T=0.7	60 / 272	47 / 222 (fail=1)
SC@5 T=0.5	73 / 264	60 / 222
SC@5 T=0.2	66 / 264	53 / 222
SC@10 T=0.5	82 / 249	75 / 227
SC@10 T=0.2	85 / 249	78 / 227
-hint	61 / 272 (22.4%)	20 / 155 (12.9%)
-hint-mistakes	60 / 272 (22.1%)	19 / 155 (12.3%)
-hint-mistakes-attempt1	54 / 272 (19.9%)	13 / 155 (8.4%)
-----------------------------------------------------------------	------------------	--------------
rims' (p2c-cot.pal-p2c.pal-cot)	57 / 272 (21.0%)	16 / 187 (8.6%)
SC@5 T=0.7	56 / 272	43 / 222 (19.4%)
SC@5 T=0.5	63 / 264	50 / 222
SC@5 T=0.2	65 / 264	52 / 222
SC@10 T=0.5	79 / 249	72 / 227
SC@10 T=0.2	77 / 249	70 / 227 (1 fail)
-hint	52 / 272 (19.1%)	11 / 187 (5.9%)
-hint-mistakes	56 / 272 (20.6%)	15 / 187 (8.0%)
-hint-mistakes-attempt1	51 / 272 (18.8%)	10 / 187 (5.3%)
-----------------------------------------------------------------	------------------	--------------
cot	49 / 272 (18.0%)	-
pal	17 / 272 (6.2%)	-
p2c	37 / 272 (13.6%)	-

gsm 1106

p2c format changed (plan do not explicitly appear in the solution, same as what original paper did)
rims-hint (85.7 %)
>~ rims+p2c_rewrote = rims_gsm_old (85.4 %)
>~ rims = -hint-mistake (85.1 %)
> rims'' (84.4 %)
- > -hint -mistake -attempt1 = rims'+p2c_rewrote (83.6 %)
- > simple greedy (82.0 %)
- > individuals (78.7 %)
tried 2 more directions (3/3 > simple greedy), + p2c solution looks more like p2c

prompt	Overall Accuracy	Success Rate
simple greedy	1081 / 1319 (82.0%)	43 / 196 (21.9%)
SC@15	1126 / 1297 ( 22 api errors )	261 / 413 (63.2%)
-----------------------------------------------------------------	------------------	--------------
rims_gsm_old	1127 / 1319 (85.4%)	86 / 193 (44.6%)
-----------------------------------------------------------------	------------------	--------------
rims	1122 / 1319 (85.1%)	81 / 193 (42.0%)
-hint	1131 / 1319 (85.7%)	90 / 193 (46.6%)
-hint-mistakes	1122 / 1319 (85.1%)	81 / 193 (42.0%)
-hint-mistakes-attempt1	1103 / 1319 (83.6%)	62 / 193 (32.1%)
+p2c_rewrote (GSM_RIMS)	1127 / 1319 (85.4%)	86 / 193 (44.6%)
SC@15 T=0.2	1151 / 1288 (+9 fails)	286 / 404 (70.8%)
SC@15 T=0.5	1153 / 1285 (+12 fails)	288 / 401 (71.8%)
-----------------------------------------------------------------	------------------	--------------
rims'+p2c_rewrote (cot2p2c.pal2cot.pal2p2c) (GSM_RIMS1)	1103 / 1319 (83.6%)	62 / 193 (32.1%)
SC@15 T=0.2	1150 / 1296 (+1 fails)	285 / 412 (69.2%)
SC@15 T=0.5	1155 / 1285 (+12 fails)	290 / 401 (72.3%)
rims''+p2c_rewrote (pal2p2c.cot2p2c.cot2pal) (GSM_RIMS2)	1113 / 1319 (84.4%)	72 / 193 (37.3%)
SC@15 T=0.2	1143 / 1292 (+5 fails)	278 / 408 (68.1%)
SC@15 T=0.5	1143 / 1292	278 / 408 (68.1%)
-----------------------------------------------------------------	------------------	--------------
cot	921 / 1319 (69.8%)	-
pal	1038 / 1319 (78.7%)	-
p2c	991 / 1319 (75.1%)	-

expand: chatgpt0613long results (fails on ocw)

# chatgpt0613long * TL;DR: 2/2 fails on ocw, 1/2 fail on math, 1/1 success on gsm ## ocw * both rims loses (-4, -5 / 272) | Component | Result | |-----------|---------------------| | cot | 52 / 272 (19.1%) | | pal | 35 / 272 (12.9%) | | p2c | 44 / 272 (16.2%) | | simple greedy | 61 / 272 (22.4%) | | rims_gsm_best | 55 / 272 (20.2%) | | rims (p2c-cot.pal-p2c.pal-cot) | 57 / 272 (21.0%) | | -hint | 52 / 272 (19.1%) | | -hint-mistakes | 56 / 272 (20.6%) | | -hint-mistakes-attempt1 | 51 / 272 (18.8%) | | rims_ocw_p2c-cot.pal-p2c.cot-p2c | 56 / 272 (20.6%) | 15 / 187 (8.0%) | | -hint | 57 / 272 (21.0%) | 16 / 187 (8.6%) | | -hint-mistakes | 56 / 272 (20.6%) | 15 / 187 (8.0%) | | -hint-mistakes-attempt1 | 55 / 272 (20.2%) | 14 / 187 (7.5%) | ## math 0613 * one wins (+115/5000), one almost loses (-10/5000) | prompt | Overall Accuracy | Success Rate | |-------------------------------------------|------------------|---------------| | simple greedy | 2120 / 4999 (42.4%) | 354 / 2438 (14.5%) | | rims_gsm_best | 2101 / 4999 (42.0%) | 335 / 2438 (13.7%) | | rims (p2c-cot.pal-p2c.pal-cot) | 2110 / 4999 (42.2%) | 344 / 2438 (14.1%) | | -hint | 2115 / 4999 (42.3%) | 349 / 2438 (14.3%) | | -hint-mistakes | 2104 / 4999 (42.1%) | 338 / 2438 (13.9%) | | -hint-mistakes-attempt1 | 2124 / 4999 (42.5%) | 358 / 2438 (14.7%) | | rims (1) | 2235 / 4999 (44.7%) | 469 / 2438 (19.2%) | | -hint | 2235 / 4999 (44.7%) | 469 / 2438 (19.2%) | | -hint-mistakes | 2219 / 4999 (44.4%) | 453 / 2438 (18.6%) | | -hint-mistakes-attempt1 | 2132 / 4999 (42.6%) | 366 / 2438 (15.0%) | | cot | 1527 / 4999 (30.5%) | - | | pal | 2047 / 4999 (40.9%) | - | | p2c | 1758 / 4999 (35.2%) | - | ## gsm 0613 | prompt | Overall Accuracy | Success Rate | |----------------------------------------------|-------------------|---------------| | simple greedy | 1067 / 1319 (80.9%) | 43 / 195 (22.1%) | | rims_gsm_old | 1112 / 1319 (84.3%) | 85 / 192 (44.3%) | | rims_gsm_new | 1126 / 1319 (85.4%) | 99 / 192 (51.6%) | | -hint | 1102 / 1319 (83.5%) | 75 / 192 (39.1%) | | -hint-mistakes | 1113 / 1319 (84.4%) | 86 / 192 (44.8%) | | -hint-mistakes-attempt1 | 1077 / 1319 (81.7%) | 50 / 192 (26.0%) | | cot | 942 / 1319 (71.4%) | - | | pal | 1056 / 1319 (80.1%) | - | | p2c | 955 / 1319 (72.4%) | - |

fgenie commented 8 months ago

apr 1

chatgpt1106 - math

(outlier) rims-hint (44.4 %)
- > rims-gsm-old ~ rims (1) ~> rims (43.8 %)
- > other ablations (=< 43.5 %)
- > simple greedy (41.7 %)
- > individual methods (=< 38.0 % )

math 1106

prompt	Overall Accuracy	Success Rate
simple greedy	2086 / 4999 (41.7%)	361 / 2550 (14.2%)
-----------------------------------------------------------------	------------------	--------------
rims_gsm_old	2192 / 4999 (43.8%)	392 / 2361 (16.6%)
-----------------------------------------------------------------	------------------	--------------
rims (p2c-cot.pal-p2c.pal-cot)	2188 / 4999 (43.8%)	388 / 2361 (16.4%)
-hint	2218 / 4999 (44.4%)	418 / 2361 (17.7%)
-hint-mistakes	2170 / 4999 (43.4%)	416 / 2503 (16.6%)
-hint-mistakes-attempt1	2151 / 4999 (43.0%)	351 / 2361 (14.9%)
-----------------------------------------------------------------	------------------	--------------
rims (1)	2191 / 4999 (43.8%)	391 / 2361 (16.6%)
-hint	2166 / 4999 (43.3%)	366 / 2361 (15.5%)
-hint-mistakes	2177 / 4999 (43.5%)	377 / 2361 (16.0%)
-hint-mistakes-attempt1	2137 / 4999 (42.7%)	382 / 2500 (15.3%)
-----------------------------------------------------------------	------------------	--------------
cot	1644 / 4999 (32.9%)
pal	1900 / 4999 (38.0%)
p2c	1796 / 4999 (35.9%)

fgenie commented 7 months ago

Apr 7

GPT-4-1106-preview results

distinctively effective methods

	cot_only	pal_only	p2c_only
math (5000)	557	104	977
ocw (272)	28	5	42
gsm (1319)	15	13	10

rims vs simple_greedy vs (cot/pal/p2c)

gsm

satisfactory result

prompt	Overall Accuracy	Success Rate (selection max: 38/41 (92.7%))
simple greedy	1249 / 1319 (94.7%)	13 / 41 (31.7%)
rims_gsm_old	1262 / 1319 (95.7%)	23 / 31 (56.1%)
rims_gsm_newer (remove p2c plan from above)	1259 / 1319 (95.5%)	20 / 41 (48.8%)
rims* (p2c2cot.pal2p2c.pal2cot)	1260 / 1319 (95.5%)	21 / 41 (51.2%)
rims* (pal2p2c.cot2p2c.cot2pal)	1256 / 1319 (95.2%)	17 / 41 (41.5%)
rims* (cot2p2c.pal2cot.pal2p2c)	1259 / 1319 (95.5%)	20 / 41 (48.8%)
cot	1110 / 1319 (84.2%)
pal	1239 / 1319 (93.9%)
p2c	1226 / 1319 (92.9%)

*those are for unifying reformatted p2c format of MATH and ocw_courses

math

satisfactory result

prompt	Overall Accuracy	Success Rate (selection max: 1638/4999 (64.5%))
simple greedy	2126 / 4999 (42.5%)	401 / 2539 (15.8%)
rims_gsm_old	2539 / 4999 (50.8%)	814 / 2539 (32.1%)
rims (p2c-cot.pal-p2c.pal-cot)	2584 / 4999 (51.7%)	859 / 2539 (33.8%)
rims (p2c-cot.pal-p2c.pal-cot) (1)	2597 / 4999 (52.0%)	872 / 2539 (34.3%)
cot	1828 / 4999 (36.6%)
pal	741 / 4999 (14.8%)
p2c	2468 / 4999 (49.4%)

*(1) has different question in fewshot blurb

ocw

unsatisfying...🧐

prompt	Overall Accuracy	Success Rate (selection max: 85/157 (54.1%))
simple greedy	69 / 272 (25.4%)	16 / 157 (10.2%)
rims_gsm_old	79 / 272 (29.0%)	26 / 157 (16.6%)
rims (p2c-cot.pal-p2c.pal-cot)	74 / 272 (27.2%)	21 / 157 (13.4%)
rims (p2c-cot.pal-p2c.cot-p2c)	67 / 272 (24.6%)	14 / 157 (8.9%)
cot	61 / 272 (22.4%)
pal	23 / 272 (8.5%)
p2c	78 / 272 (28.7%)

rims_old > p2c > rims1 > simple_greedy > rims2 > cot > pal
rims2 looks problematic, p2c too dominant over others

fgenie commented 7 months ago

prompts differences in GSM

The following applies to all prompts in gsm exp: rims, simple-greedy prompts (last one is finally used) For MATH and OCW, p2c plans are sometimes appears implicitly, sometimes explicitly (even though the prompts generated those were all explicit!)

# p2c in gsm_old
{NUMBERED LIST} # plan 
{CODE} # code 

# p2c in gsm_newer (that is, "remove plan" above in gpt4 table)
def solution():
    """ docstring usually dropped the plan given in the prompt """
    {CODE} # but the code includes kind of numbered comments.

# p2c in rims* above in gpt4 table
 def solution():
    """
    questions and some explanations
    {NUMBERED_LIST_PLAN}
    """
    {CODE}

For more, see the prompts below:

fgenie commented 7 months ago

reflection vs selection effect

Amongst rims-correct, that is, when we have conflict between individual methods;
- an example is considered correct by selection_effect if the rims answered with the same method that originally answered correctly.
- otherwise, examples are considered correct by reflection_effect.
- e.g.1: if the only correct method amongst individual agents was cot but rims answers with pal and evaluated correct, it is considered correct by reflection_effect.
- e.g.2: if the only correct was pal and rims answers with pal and evaluated correct, it is considered selection_effect.
- we don't consider rims answers that evaluated wrong here. (can do if needed)
- selection_effect breaks down to select_cot|pal|p2c.

gpt-4-1106-preview

math

	reflection_effect	selection_effect	select_p2c	select_pal	select_cot
simple greedy	0.0 %	100.0 %	29.4 %	8.0 %	62.6 %
rims_gsm_old (p2c2cot.pal2p2c.pal2cot)	90.5 %	9.5 %	0.6 %	0.1 %	8.7 %
rims* (p2c-cot.pal-p2c.pal-cot) (1)	85.1 %	14.9 %	5.2 %	0.2 %	9.5 %
rims* (p2c-cot.pal-p2c.pal-cot)	85.6 %	14.4 %	5.6 %	0.1 %	8.7 %

gsm

	reflection_effect	selection_effect	select_p2c	select_pal	select_cot
simple greedy	0.0%	100.0%	7.7%	38.5%	53.8%
rims_gsm_old (p2c2cot.pal2p2c.pal2cot)	60.9%	39.1%	0.0%	0.0%	39.1%
~~rims_gsm_old (remove plan)~~	~~65.0%~~	~~35.0%~~	~~0.0%~~	~~0.0%~~	~~35.0%~~
rims* (p2c2cot.pal2p2c.pal2cot)	66.7%	33.3%	0.0%	0.0%	33.3%
rims* (pal2p2c.cot2p2c.cot2pal)	64.7%	35.3%	0.0%	0.0%	35.3%
rims* (cot2p2c.pal2cot.pal2p2c)	55.0%	45.0%	10.0%	0.0%	35.0%

ocw

	reflection_effect	selection_effect	select_p2c	select_pal	select_cot
simple greedy	0.0%	100.0%	18.8%	0.0%	81.2%
rims_gsm_old (p2c2cot.pal2p2c.pal2cot)	96.2%	3.8%	0.0%	0.0%	3.8%
rims* (p2c2cot.pal2p2c.pal2cot)	90.5%	9.5%	9.5%	0.0%	0.0%
rims* (p2c2cot.pal2p2c.cot2p2c)	71.4%	28.6%	28.6%	0.0%	0.0%

gpt-3.5-turbo-1106

math

	reflection_effect	selection_effect	select_p2c	select_pal	select_cot
simple greedy	0.0 %	100.0 %	3.0 %	13.3 %	83.7 %
rims_gsm_old	67.1 %	32.9 %	5.6 %	17.6 %	9.7 %
rims* p2c-cot.pal-p2c.pal-cot	75.5 %	24.5 %	7.0 %	2.8 %	14.7 %
rims* p2c-cot.pal-p2c.pal-cot (1)	75.2 %	24.8 %	7.9 %	5.4 %	11.5 %
rims* p2c-cot.pal-p2c.pal-cot-hint	75.6 %	24.4 %	6.9 %	2.4 %	15.1 %
rims* p2c-cot.pal-p2c.pal-cot-hint (1)	77.6 %	22.4 %	8.5 %	0.8 %	13.1 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes	27.2 %	72.8 %	5.5 %	9.1 %	58.2 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes (1)	70.6 %	29.4 %	7.2 %	5.8 %	16.4 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes-attempt1	70.7 %	29.3 %	6.6 %	14.0 %	8.8 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes-attempt1 (1)	30.1 %	69.9 %	5.0 %	12.3 %	52.6 %

gsm

	reflection_effect	selection_effect	select_p2c	select_pal	select_cot
simple greedy	0.0 %	100.0 %	2.3 %	2.3 %	95.3 %
rims_gsm_old	79.1 %	20.9 %	3.5 %	1.2 %	16.3 %
rims_gsm_newer	69.1 %	30.9 %	1.2 %	2.5 %	27.2 %
rims_gsm_newer-hint	73.3 %	26.7 %	0.0 %	0.0 %	26.7 %
rims_gsm_newer-hint-mistakes	74.1 %	25.9 %	0.0 %	2.5 %	23.5 %
rims_gsm_newer-hint-mistakes-attempt1	62.9 %	37.1 %	1.6 %	32.3 %	3.2 %
rims* p2c2cot.pal2p2c.pal2cot	69.8 %	30.2 %	9.3 %	2.3 %	18.6 %
rims* pal2p2c.cot2p2c.cot2pal	82.3 %	17.7 %	3.2 %	12.9 %	1.6 %
rims* cot2p2c.pal2cot.pal2p2c	70.8 %	29.2 %	8.3 %	2.8 %	18.1 %

ocw

	reflection_effect	selection_effect	select_p2c	select_pal	select_cot
simple greedy	0.0 %	100.0 %	18.2 %	0.0 %	81.8 %
rims_gsm_old	75.0 %	25.0 %	12.5 %	6.2 %	6.2 %
rims* p2c-cot.pal-p2c.cot-p2c	59.1 %	40.9 %	31.8 %	0.0 %	9.1 %
rims* p2c-cot.pal-p2c.pal-cot	73.3 %	26.7 %	6.7 %	0.0 %	20.0 %
rims* p2c-cot.pal-p2c.cot-p2c-hint	80.0 %	20.0 %	5.0 %	0.0 %	15.0 %
rims* p2c-cot.pal-p2c.pal-cot-hint	80.0 %	20.0 %	0.0 %	0.0 %	20.0 %
rims* p2c-cot.pal-p2c.cot-p2c-hint-mistakes	73.7 %	26.3 %	15.8 %	0.0 %	10.5 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes	86.7 %	13.3 %	6.7 %	0.0 %	6.7 %
rims* p2c-cot.pal-p2c.cot-p2c-hint-mistakes-attempt1	100.0 %	0.0 %	0.0 %	0.0 %	0.0 %
rims* p2c-cot.pal-p2c.pal-cot-hint-mistakes-attempt1	70.0 %	30.0 %	20.0 %	0.0 %	10.0 %

fgenie commented 7 months ago

(승재님) rims의 비용적인 면을 잘 계산해두는게 좋을 것 같다. (chul) (2) 에다가 self-consistency를 적용하는게 좋겠다. simple-greedy + SC (1) 한 것과 성능 비교 (chatgpt) (chul) opensource LLM 실험은 웬만하면 여기서? (기창님이 여유가 없다)

fgenie commented 7 months ago

Apr 13: 쉬어갑니다

Apr 20

오늘 결과 (SC=5)는 mar 30 테이블에 병기
- success rate 은 가장 간단하게 잰 거라서 감안하고 봐야함
simple greedy 초벌 openai SC=15는 주말간 dgx에서 구동
openllm은 llama3
rims temperature calibration: try ~0 temperature

fgenie commented 7 months ago

Apr 29: SC results only table

(will update further after math done on Tue)

Self-consistency / Temperature

gpt-3.5-turbo-1106

for both rims and simple-greedy (baseline) SC helps
simple-greedy (baseline) @ SC5
at any SC results:
- rims > simple-greedy
Rims Performance @ Temperature 0.5 ~ 0.2 > 0.7

OCW

gpt-3.5-turbo-1106

note that rims (blabla) is just for example combination + orderings.

prompt	Overall Accuracy	Success Rate
simple greedy	44/272 (16.2%)	11/187 (5.9%)
SC@5 (cotT0.5, palT0.8)	57 / 272	44 / 222
SC@10	69 / 249 (23 failed)	62 / 227
-----------------------------------------------------------------	------------------	--------------
rims (p2c-cot.pal-p2c.cot-p2c)	63 / 272 (23.2%)	22 / 155 (14.2%)
SC@5 T=0.2	66 / 264	53 / 222
SC@5 T=0.5	73 / 264	60 / 222
SC@5 T=0.7	60 / 272	47 / 222 (fail=1)
SC@10 T=0.2	85 / 249	78 / 227
SC@10 T=0.5	82 / 249	75 / 227
-----------------------------------------------------------------	------------------	--------------
rims (p2c-cot.pal-p2c.pal-cot)	57 / 272 (21.0%)	16 / 187 (8.6%)
SC@5 T=0.2	65 / 264	52 / 222
SC@5 T=0.5	63 / 264	50 / 222
SC@5 T=0.7	56 / 272	43 / 222 (19.4%)
SC@10 T=0.2	77 / 249	70 / 227 (1 fail)
SC@10 T=0.5	79 / 249	72 / 227

GSM

`gpt-3.5-turbo-1106`	prompt	Overall Accuracy
simple greedy	1081 / 1319 (82.0%)	43 / 196 (21.9%)
SC@15	1126 / 1297 ( 22 api errors )	261 / 413 (63.2%)
-----------------------------------------------------------------	------------------	--------------
rims (newer_best_p2c2cot.pal2p2c.pal2cot)	1122 / 1319 (85.1%)	81 / 193 (42.0%)
SC@15 T=0.2	1151 / 1288 (+9 fails)	286 / 404 (70.8%)
SC@15 T=0.5	1153 / 1285 (+12 fails)	288 / 401 (71.8%)
-----------------------------------------------------------------	------------------	--------------
rims (cot2p2c.pal2cot.pal2p2c) (GSM_RIMS1)	1103 / 1319 (83.6%)	62 / 193 (32.1%)
SC@15 T=0.2	1150 / 1296 (+1 fails)	285 / 412 (69.2%)
SC@15 T=0.5	1155 / 1285 (+12 fails)	290 / 401 (72.3%)
rims (pal2p2c.cot2p2c.cot2pal) (GSM_RIMS2)	1113 / 1319 (84.4%)	72 / 193 (37.3%)
SC@15 T=0.2	1143 / 1292 (+5 fails)	278 / 408 (68.1%)
SC@15 T=0.5	1143 / 1292	278 / 408 (68.1%)

fgenie commented 7 months ago

Apr 30

Somethings going super wrong (SC@5 << T=0 n=1)

math 1106

prompt	Overall Accuracy	Success Rate
simple greedy	2086 / 4999 (41.7%)	361 / 2550 (14.2%)
SC@5, cotT=0.5 / palT=0.8	439 / 4904 (9.0%)	318 / 3684 (8.6%)
-----------------------------------------------------------------	------------------	--------------
rims (p2c-cot.pal-p2c.pal-cot)	2188 / 4999 (43.8%)	388 / 2361 (16.4%)
SC@5, T=0.2	536 / 4897 (10.9%)	415 / 3677 (11.3%)
SC@5, T=0.5
-----------------------------------------------------------------	------------------	--------------
rims (1)	2191 / 4999 (43.8%)	391 / 2361 (16.6%)
SC@5, T=0.2
SC@5, T=0.5

fgenie commented 6 months ago

May: panic!

채점과정 혹은 rims inference on baseline output에서 selection 빈도가 올라가거나 selection 채점 점수가 추가됨 체크해보아야 함
- 이 과정에서 math new prompts를 새로 찾았고, 그걸로 실험해둔 결과 정리해두지 않음
- 그 외 하드나 dgx 서버에 존재하는 실험 결과도 일부 정리해두지 않았음. chatgpt SC@10 실험들
chatgpt 실험 일부 gpt4turbo 실험 일부가 남아있음.
phi3-mini-instruct-128k 서빙에 문제 겪음: vllm openai public container image 에서 해결 가능했던 것 발견
- 여기에 맞춰 수정해둔 코드 때문에 현재 코드는 정상구동을 위해 수정이 필요

fgenie commented 5 months ago

Jun 2

@strutive07 @fgenie chul

phi3: cot, pal, p2c 결과를 보아서는 text processing bug가 있는 것으로 의심됨 (reference: ~80%, 코드: <5%)
llama-8B 을 reproduce 성능 기준으로 잡고
- text processing
- prompt prep + querying 부분을 분리하는 과정이 필요 (손선일 -> 장원준)

fgenie commented 5 months ago

Jun 23

refactored gsm8k results

Phi3 small

https://github.com/fgenie/rims_minimal/blob/openllm-sj/src/0_refactoring/outputs/gsm8K_test_dt.gsm/Phi-3-small-128k-instruct/processed_indiv_scored.md

Llama3 8B

https://github.com/fgenie/rims_minimal/blob/openllm-sj/src/0_refactoring/dbg_llama/processed_indiv_scored.md

chul: openllm 은 위 둘을 사용하는걸로

new model

Claude-3.5-sonnet 견적 내보기 (https://www.computerworld.com/article/2472913/anthropic-claude-3-5-sonnet-is-here-and-its-free.html)

병행 구현

손선일: rims, model-selection 구현
장원준: math, ocw 결과 리포팅

legacy

지난 실험 결과 유실되지 않게 잘 정리해두기

strutive07 commented 4 months ago

Opensource gsm8k, math, ocw 성능표

rims: best scores only
Temperature = 0., n=1

GSM8K
Model	Score file	cot	pal	p2c	simple greedy	rims
Meta-Llama-3-8B-Instruct	link	0.7301	0.7597	0.6513	0.817 (1078/1319)	0.831 (1096/1319)
Phi-3-small-128k-instruct	link	0.8438	0.8635	0.8097	0.906 (1195/1319)	0.920 (1213/1319)

Math
Model	Score file	cot	pal	p2c	simple greedy	rims
Meta-Llama-3-8B-Instruct	link	0.3016	0.1498	0.2134	0.319 (1597/5000)	0.320 (1601/5000)
Phi-3-small-128k-instruct	link	0.3684	0.3808	0.3628	0.462 (2308/5000)	0.414 (2072/5000)

OCW
Model	Score file	cot	pal	p2c	simple greedy	rims
Meta-Llama-3-8B-Instruct	link	0.1213	0.0257	0.0662	0.121 (33/272)	0.110 (30/272)
Phi-3-small-128k-instruct	link	0.2684	0.1360	0.1801	0.199 (54/272)	0.165 (45/272)

왜 어떤 selection은 단일 방법보다 낮은 성능?

...voting이 보장하는 selection 알고리즘 성능 최저점 (각각이 맞아야함 + 맞는 답끼리 서로 일치하는지 파싱이 성공해야함) 이 매우 다릅니다. gsm은 각각 방법보다 voting만으로도 셋의 최고점 이상에서 시작하며 나머지 둘은 아닌 것처럼 보이네요.

gsm의 경우 majority vote 에서 정답을 맞출 수 있는 갯수가 1165로 이는 각각 max(cot =1113, pal=1139, p2c=1068) 보다 높습니다.
ocw의 경우는 그렇지 않습니다. voting으로 맞출 수 있는 문항의 수가 38이며 이는 각각 방법의 최저점선과 유사합니다. cot=72 > p2c=49 > voting=38 > pal=37. 24개 (9%p) 개 이상을 selection이 맞춰줘야만 각각의 최고보다 높은 점수를 기록합니다.
math의 경우도 voting이 맞추는 정답은 1811개로 cot, pal, p2c (1842, 1904, 1814) 의 최저점보다 낮습니다. 나머지 성능 증가치는 selection 알고리즘이 해줘야합니다. 100개 (2%p) 이상을 selection이 해줘야만 각각의 최고점보다 높은 성능을 기록한다고 봐야합니다.

fgenie commented 4 months ago

result files

meta-llama-8B-instruct

	gsm	math	ocw
rims	link0, link1, link2,	link0, link1	link1, link2
simple greedy	link0	link0	link0

fgenie commented 4 months ago

phi3-small (~8B)

해당 결과 파일은 5e0bb5f6a378d39da8d8102485350383ac6bfa60 에서 확인할 수 있습니다. @strutive07

fgenie / rims_minimal

회의록 Feb 11~ #37

Submission deadline

OCW low perf:

openLLM 실험

LLM:

새 데이터?

To Report

must

good to report

related papers

reasoning with LLM

possibly be related to...

others

Failure in symbolic problem solving (OCW)

Math analyses

1: According to MATH-type

RIMS prompt v3 (originated from GSM train examples)

Full Table

aggregate

2: According to MATH-level

How unique each method is?

separating selection and feedback effect (chatgpt, MATH)

Feb 18

Feb 25

OCW prompt 가공중에

cot

pal

p2c

Mar 10

CoT 결과 처리 수정

TL;DR

PAL/P2C (코드) 실행 결과 처리 수정

Mar 17

바뀐 채점 코드로 생기는 이전 결과의 변동

be cautious!

Mar 24

mar 27

mar 30 (updated: apr 1)

OVERLAPs between methods

mar 30 (1)

chatgpt1106

ocw 1106

gsm 1106

apr 1

chatgpt1106 - math

math 1106

Apr 7

GPT-4-1106-preview results

distinctively effective methods

rims vs simple_greedy vs (cot/pal/p2c)

gsm

math

ocw

prompts differences in GSM

reflection vs selection effect

gpt-4-1106-preview

math

gsm

ocw

gpt-3.5-turbo-1106

math

gsm

ocw

Apr 13: 쉬어갑니다

Apr 20

Apr 29: SC results only table

Self-consistency / Temperature

OCW

GSM

Apr 30

math 1106

May: panic!

Jun 2

Jun 23

refactored gsm8k results

Phi3 small

Llama3 8B

new model

병행 구현