fgenie commented 5 months ago

원래의 방법으로 풀 수 없었던 경우를 RIMS 프롬프트에서 판단하고 맞는 풀이방법을 선택하여 맞추는 경우가 있는가? (Chul)

MATH, OCW 에서 발견될 수도 있는 현상인데, 이걸 하려면 symbolic problem 예시를 포함하는 rims prompt, cot pal p2c 프롬프트로 실험해야할 것으로 보인다.

여태까지의 실험은 gsm8K train 에서 뽑은 hard example 들로 만든 프롬프트이고 그걸로 모든 데이터셋에 일괄적용하고 있다.

그런데 되는건 그냥 chatGPT가 많은걸 이미 배워뒀기 때문이라고 보이긴 함. 물론 fewshot도 맞는 예시로 주면 더 잘 할 것이라 예상한다.

Other open LLM?

fgenie commented 5 months ago

(과거 자료 중 쓸만할지도 1) numbers from ablation

(11월 이전의 조악한 interpreter로 진행한 실험이므로 신뢰도는 떨어지지만, only one method로 진행한 경우들이 얼마나 상보적인지 따져볼 경우 이와 비슷하게 다시 계산기를 두드려볼 수 있다.)

While preparing the reflexion-kshot-harvesting prompt, I report some numbers we can expect from the previous ablations.

Our kshot-harvesting agent needs a prompt that invokes a decision from the followings. For better solution the agent should...

switch the model (e.g. cot -> pal)
retry the model (e.g. cot -> cot)

To prepare the prompt, I've explored ablation (single-modeled results) on gsm8k (to see realistic cases for model switching)

(condition chatgpt, greedy sampling, standard few-shot prompts for gsm8k)

When greedily decoded, how many each models get wrong over 1319 questions?

pal: 268 p2c: 381 cot: 279

Do they fail on the same questions? no

len(pal_wrongs.intersection(p2c_wrongs))=192 len(pal_wrongs.intersection(cot_wrongs))=147 len(p2c_wrongs.intersection(cot_wrongs))=181

What is the lowerbound of the error rate using three models above optimally? 9.02% (=119/1319)

len(pal_wrongs.intersection(p2c_wrongs).intersection(cot_wrongs))=119

fgenie commented 5 months ago

(과거 자료중 쓸만할지도 2) 48가지로 구성이 가능한 rims prompt, 그 중 9개의 각각 성능/ablation: gpt-4-0613 + 조악한 interpreter

아래 우리가 지금 실험해놓은 프롬이 어떤 프롬인지 표기해놓았다
보면 알겠지만 rims > rims-reflectionless(ablation) 인 요건은 충족하지만 최고성능의 프롬은 아닌 것으로 보임.

rims 프롬에서 indiv_eval 옵션을 넣었을 때는 9개 프롬프트 평균으로 치면 성능이 떨어지며, 수행해야하는 토큰 수도 훨씬 많아지므로 고려사항에서 배제하자.

pal2cot.cot2p2c.pal2p2c		exp	success_rate	failed
12	rims	0.611	11/18	pal2cot.cot2p2c.pal2p2c
13	rims +indiv_eval	0.5	9/18	pal2cot.cot2p2c.pal2p2c
28	ablation	0.667	12/18	pal2cot.cot2p2c.pal2p2c
29	ablation +indiv_eval	0.5	9/18	pal2cot.cot2p2c.pal2p2c

p2c2pal.cot2p2c.cot2pal (최고 12/18, abl과 차이는 3)		exp	success_rate	failed
10	rims	0.667	12/18	p2c2pal.cot2p2c.cot2pal
11	rims +indiv_eval	0.556	10/18	p2c2pal.cot2p2c.cot2pal
26	ablation	0.5	9/18	p2c2pal.cot2p2c.cot2pal
27	ablation +indiv_eval	0.5	9/18	p2c2pal.cot2p2c.cot2pal

p2c2cot.cot2pal.pal2p2c		exp	success_rate	failed
6	rims	0.611	11/18	p2c2cot.cot2pal.pal2p2c
7	rims +indiv_eval	0.389	7/18	p2c2cot.cot2pal.pal2p2c
22	ablation	0.667	12/18	p2c2cot.cot2pal.pal2p2c
23	ablation +indiv_eval	0.5	9/18	p2c2cot.cot2pal.pal2p2c

pal2p2c.cot2p2c.cot2pal		exp	success_rate	failed
16	rims	0.556	10/18	pal2p2c.cot2p2c.cot2pal
17	rims +indiv_eval	0.444	8/18	pal2p2c.cot2p2c.cot2pal
32	ablation	0.444	8/18	pal2p2c.cot2p2c.cot2pal
33	ablation +indiv_eval	0.389	7/18	pal2p2c.cot2p2c.cot2pal

p2c2cot.pal2p2c.pal2cot (근접 최고, 11/18, abl과 차이 4/18)		exp	success_rate	failed
8	rims	0.611	11/18	p2c2cot.pal2p2c.pal2cot
9	rims +indiv_eval	0.389	7/18	p2c2cot.pal2p2c.pal2cot
24	ablation	0.389	7/18	p2c2cot.pal2p2c.pal2cot
25	ablation +indiv_eval	0.389	7/18	p2c2cot.pal2p2c.pal2cot

cot2p2c.pal2cot.pal2p2c < 지금 이 repo에서 우리가 실험한 프롬프트		exp	success_rate	failed
0	rims	0.5	9/18	cot2p2c.pal2cot.pal2p2c
1	rims +indiv_eval	0.667	12/18	cot2p2c.pal2cot.pal2p2c
18	ablation	0.444	8/18	cot2p2c.pal2cot.pal2p2c
19	ablation +indiv_eval	0.5	9/18	cot2p2c.pal2cot.pal2p2c

cot2pal.cot2p2c.pal2p2c		exp	success_rate	failed
2	rims	0.611	11/18	cot2pal.cot2p2c.pal2p2c
3	rims +indiv_eval	0.5	9/18	cot2pal.cot2p2c.pal2p2c
20	ablation	0.5	9/18	cot2pal.cot2p2c.pal2p2c
21	ablation +indiv_eval	0.444	8/18	cot2pal.cot2p2c.pal2p2c

pal2cot.p2c2cot.p2c2pal		exp	success_rate	failed
14	rims	0.5	9/18	pal2cot.p2c2cot.p2c2pal
15	rims +indiv_eval	0.556	10/18	pal2cot.p2c2cot.p2c2pal
30	ablation	0.389	7/18	pal2cot.p2c2cot.p2c2pal
31	ablation +indiv_eval	0.5	9/18	pal2cot.p2c2cot.p2c2pal

~~cot2pal.p2c2cot.p2c2pal~~ ~~| | exp | success_rate | numbers | failed | prompt |~~ ~~|---:|:-----------------|---------------:|:----------|---------:|:------------------------|~~ ~~| 4 | rims | 0.611 | 11/18 | 0 | cot2pal.p2c2cot.p2c2pal |~~ ~~| 5 | rims +indiv_eval | 0.5 | 9/18 | 0 | cot2pal.p2c2cot.p2c2pal |~~

gpt4_3method_conflicts_model_selection_baseline		exp	success_rate	numbers	failed	prompt
34	baseline	0.167	3/18	0	gpt4_3method_conflicts_model_selection_baseline

fgenie / rims_minimal

Points to discuss #27

(과거 자료 중 쓸만할지도 1) numbers from ablation

(과거 자료중 쓸만할지도 2) 48가지로 구성이 가능한 rims prompt, 그 중 9개의 각각 성능/ablation: gpt-4-0613 + 조악한 interpreter