Open fgenie opened 5 months ago
(11월 이전의 조악한 interpreter로 진행한 실험이므로 신뢰도는 떨어지지만, only one method로 진행한 경우들이 얼마나 상보적인지 따져볼 경우 이와 비슷하게 다시 계산기를 두드려볼 수 있다.)
While preparing the reflexion-kshot-harvesting prompt, I report some numbers we can expect from the previous ablations.
Our kshot-harvesting agent needs a prompt that invokes a decision from the followings. For better solution the agent should...
To prepare the prompt, I've explored ablation (single-modeled results) on gsm8k (to see realistic cases for model switching)
(condition chatgpt, greedy sampling, standard few-shot prompts for gsm8k)
When greedily decoded, how many each models get wrong over 1319 questions?
pal: 268 p2c: 381 cot: 279
Do they fail on the same questions? no
len(pal_wrongs.intersection(p2c_wrongs))=192 len(pal_wrongs.intersection(cot_wrongs))=147 len(p2c_wrongs.intersection(cot_wrongs))=181
What is the lowerbound of the error rate using three models above optimally? 9.02% (=119/1319)
len(pal_wrongs.intersection(p2c_wrongs).intersection(cot_wrongs))=119
pal2cot.cot2p2c.pal2p2c | exp | success_rate | numbers | failed | prompt | |
---|---|---|---|---|---|---|
12 | rims | 0.611 | 11/18 | 0 | pal2cot.cot2p2c.pal2p2c | |
13 | rims +indiv_eval | 0.5 | 9/18 | 0 | pal2cot.cot2p2c.pal2p2c | |
28 | ablation | 0.667 | 12/18 | 0 | pal2cot.cot2p2c.pal2p2c | |
29 | ablation +indiv_eval | 0.5 | 9/18 | 0 | pal2cot.cot2p2c.pal2p2c |
p2c2pal.cot2p2c.cot2pal (최고 12/18, abl과 차이는 3) | exp | success_rate | numbers | failed | prompt | |
---|---|---|---|---|---|---|
10 | rims | 0.667 | 12/18 | 0 | p2c2pal.cot2p2c.cot2pal | |
11 | rims +indiv_eval | 0.556 | 10/18 | 0 | p2c2pal.cot2p2c.cot2pal | |
26 | ablation | 0.5 | 9/18 | 0 | p2c2pal.cot2p2c.cot2pal | |
27 | ablation +indiv_eval | 0.5 | 9/18 | 0 | p2c2pal.cot2p2c.cot2pal |
p2c2cot.cot2pal.pal2p2c | exp | success_rate | numbers | failed | prompt | |
---|---|---|---|---|---|---|
6 | rims | 0.611 | 11/18 | 0 | p2c2cot.cot2pal.pal2p2c | |
7 | rims +indiv_eval | 0.389 | 7/18 | 0 | p2c2cot.cot2pal.pal2p2c | |
22 | ablation | 0.667 | 12/18 | 0 | p2c2cot.cot2pal.pal2p2c | |
23 | ablation +indiv_eval | 0.5 | 9/18 | 0 | p2c2cot.cot2pal.pal2p2c |
pal2p2c.cot2p2c.cot2pal | exp | success_rate | numbers | failed | prompt | |
---|---|---|---|---|---|---|
16 | rims | 0.556 | 10/18 | 0 | pal2p2c.cot2p2c.cot2pal | |
17 | rims +indiv_eval | 0.444 | 8/18 | 0 | pal2p2c.cot2p2c.cot2pal | |
32 | ablation | 0.444 | 8/18 | 0 | pal2p2c.cot2p2c.cot2pal | |
33 | ablation +indiv_eval | 0.389 | 7/18 | 0 | pal2p2c.cot2p2c.cot2pal |
p2c2cot.pal2p2c.pal2cot (근접 최고, 11/18, abl과 차이 4/18) | exp | success_rate | numbers | failed | prompt | |
---|---|---|---|---|---|---|
8 | rims | 0.611 | 11/18 | 0 | p2c2cot.pal2p2c.pal2cot | |
9 | rims +indiv_eval | 0.389 | 7/18 | 0 | p2c2cot.pal2p2c.pal2cot | |
24 | ablation | 0.389 | 7/18 | 0 | p2c2cot.pal2p2c.pal2cot | |
25 | ablation +indiv_eval | 0.389 | 7/18 | 0 | p2c2cot.pal2p2c.pal2cot |
cot2p2c.pal2cot.pal2p2c < 지금 이 repo에서 우리가 실험한 프롬프트 | exp | success_rate | numbers | failed | prompt | |
---|---|---|---|---|---|---|
0 | rims | 0.5 | 9/18 | 0 | cot2p2c.pal2cot.pal2p2c | |
1 | rims +indiv_eval | 0.667 | 12/18 | 0 | cot2p2c.pal2cot.pal2p2c | |
18 | ablation | 0.444 | 8/18 | 0 | cot2p2c.pal2cot.pal2p2c | |
19 | ablation +indiv_eval | 0.5 | 9/18 | 0 | cot2p2c.pal2cot.pal2p2c |
cot2pal.cot2p2c.pal2p2c | exp | success_rate | numbers | failed | prompt | |
---|---|---|---|---|---|---|
2 | rims | 0.611 | 11/18 | 0 | cot2pal.cot2p2c.pal2p2c | |
3 | rims +indiv_eval | 0.5 | 9/18 | 0 | cot2pal.cot2p2c.pal2p2c | |
20 | ablation | 0.5 | 9/18 | 0 | cot2pal.cot2p2c.pal2p2c | |
21 | ablation +indiv_eval | 0.444 | 8/18 | 0 | cot2pal.cot2p2c.pal2p2c |
pal2cot.p2c2cot.p2c2pal | exp | success_rate | numbers | failed | prompt | |
---|---|---|---|---|---|---|
14 | rims | 0.5 | 9/18 | 0 | pal2cot.p2c2cot.p2c2pal | |
15 | rims +indiv_eval | 0.556 | 10/18 | 0 | pal2cot.p2c2cot.p2c2pal | |
30 | ablation | 0.389 | 7/18 | 0 | pal2cot.p2c2cot.p2c2pal | |
31 | ablation +indiv_eval | 0.5 | 9/18 | 0 | pal2cot.p2c2cot.p2c2pal |
cot2pal.p2c2cot.p2c2pal
| | exp | success_rate | numbers | failed | prompt |
|---:|:-----------------|---------------:|:----------|---------:|:------------------------|
| 4 | rims | 0.611 | 11/18 | 0 | cot2pal.p2c2cot.p2c2pal |
| 5 | rims +indiv_eval | 0.5 | 9/18 | 0 | cot2pal.p2c2cot.p2c2pal |
gpt4_3method_conflicts_model_selection_baseline | exp | success_rate | numbers | failed | prompt | |
---|---|---|---|---|---|---|
34 | baseline | 0.167 | 3/18 | 0 | gpt4_3method_conflicts_model_selection_baseline |
원래의 방법으로 풀 수 없었던 경우를 RIMS 프롬프트에서 판단하고 맞는 풀이방법을 선택하여 맞추는 경우가 있는가? (Chul)
여태까지의 실험은 gsm8K train 에서 뽑은 hard example 들로 만든 프롬프트이고 그걸로 모든 데이터셋에 일괄적용하고 있다.
Other open LLM?