fgenie / rims_minimal

Been lazy enough to pull over again to the end!
0 stars 1 forks source link

Points to discuss #27

Open fgenie opened 5 months ago

fgenie commented 5 months ago

원래의 방법으로 풀 수 없었던 경우를 RIMS 프롬프트에서 판단하고 맞는 풀이방법을 선택하여 맞추는 경우가 있는가? (Chul)

여태까지의 실험은 gsm8K train 에서 뽑은 hard example 들로 만든 프롬프트이고 그걸로 모든 데이터셋에 일괄적용하고 있다.

Other open LLM?

fgenie commented 5 months ago

(과거 자료 중 쓸만할지도 1) numbers from ablation

(11월 이전의 조악한 interpreter로 진행한 실험이므로 신뢰도는 떨어지지만, only one method로 진행한 경우들이 얼마나 상보적인지 따져볼 경우 이와 비슷하게 다시 계산기를 두드려볼 수 있다.)

While preparing the reflexion-kshot-harvesting prompt, I report some numbers we can expect from the previous ablations.

Our kshot-harvesting agent needs a prompt that invokes a decision from the followings. For better solution the agent should...

To prepare the prompt, I've explored ablation (single-modeled results) on gsm8k (to see realistic cases for model switching)

(condition chatgpt, greedy sampling, standard few-shot prompts for gsm8k)

When greedily decoded, how many each models get wrong over 1319 questions?

pal: 268 p2c: 381 cot: 279

Do they fail on the same questions? no

len(pal_wrongs.intersection(p2c_wrongs))=192 len(pal_wrongs.intersection(cot_wrongs))=147 len(p2c_wrongs.intersection(cot_wrongs))=181

What is the lowerbound of the error rate using three models above optimally? 9.02% (=119/1319)

len(pal_wrongs.intersection(p2c_wrongs).intersection(cot_wrongs))=119

fgenie commented 5 months ago

(과거 자료중 쓸만할지도 2) 48가지로 구성이 가능한 rims prompt, 그 중 9개의 각각 성능/ablation: gpt-4-0613 + 조악한 interpreter

p2c2pal.cot2p2c.cot2pal (최고 12/18, abl과 차이는 3) exp success_rate numbers failed prompt
10 rims 0.667 12/18 0 p2c2pal.cot2p2c.cot2pal
11 rims +indiv_eval 0.556 10/18 0 p2c2pal.cot2p2c.cot2pal
26 ablation 0.5 9/18 0 p2c2pal.cot2p2c.cot2pal
27 ablation +indiv_eval 0.5 9/18 0 p2c2pal.cot2p2c.cot2pal
p2c2cot.cot2pal.pal2p2c exp success_rate numbers failed prompt
6 rims 0.611 11/18 0 p2c2cot.cot2pal.pal2p2c
7 rims +indiv_eval 0.389 7/18 0 p2c2cot.cot2pal.pal2p2c
22 ablation 0.667 12/18 0 p2c2cot.cot2pal.pal2p2c
23 ablation +indiv_eval 0.5 9/18 0 p2c2cot.cot2pal.pal2p2c
pal2p2c.cot2p2c.cot2pal exp success_rate numbers failed prompt
16 rims 0.556 10/18 0 pal2p2c.cot2p2c.cot2pal
17 rims +indiv_eval 0.444 8/18 0 pal2p2c.cot2p2c.cot2pal
32 ablation 0.444 8/18 0 pal2p2c.cot2p2c.cot2pal
33 ablation +indiv_eval 0.389 7/18 0 pal2p2c.cot2p2c.cot2pal
p2c2cot.pal2p2c.pal2cot (근접 최고, 11/18, abl과 차이 4/18) exp success_rate numbers failed prompt
8 rims 0.611 11/18 0 p2c2cot.pal2p2c.pal2cot
9 rims +indiv_eval 0.389 7/18 0 p2c2cot.pal2p2c.pal2cot
24 ablation 0.389 7/18 0 p2c2cot.pal2p2c.pal2cot
25 ablation +indiv_eval 0.389 7/18 0 p2c2cot.pal2p2c.pal2cot
cot2p2c.pal2cot.pal2p2c < 지금 이 repo에서 우리가 실험한 프롬프트 exp success_rate numbers failed prompt
0 rims 0.5 9/18 0 cot2p2c.pal2cot.pal2p2c
1 rims +indiv_eval 0.667 12/18 0 cot2p2c.pal2cot.pal2p2c
18 ablation 0.444 8/18 0 cot2p2c.pal2cot.pal2p2c
19 ablation +indiv_eval 0.5 9/18 0 cot2p2c.pal2cot.pal2p2c
cot2pal.cot2p2c.pal2p2c exp success_rate numbers failed prompt
2 rims 0.611 11/18 0 cot2pal.cot2p2c.pal2p2c
3 rims +indiv_eval 0.5 9/18 0 cot2pal.cot2p2c.pal2p2c
20 ablation 0.5 9/18 0 cot2pal.cot2p2c.pal2p2c
21 ablation +indiv_eval 0.444 8/18 0 cot2pal.cot2p2c.pal2p2c
pal2cot.p2c2cot.p2c2pal exp success_rate numbers failed prompt
14 rims 0.5 9/18 0 pal2cot.p2c2cot.p2c2pal
15 rims +indiv_eval 0.556 10/18 0 pal2cot.p2c2cot.p2c2pal
30 ablation 0.389 7/18 0 pal2cot.p2c2cot.p2c2pal
31 ablation +indiv_eval 0.5 9/18 0 pal2cot.p2c2cot.p2c2pal

cot2pal.p2c2cot.p2c2pal | | exp | success_rate | numbers | failed | prompt | |---:|:-----------------|---------------:|:----------|---------:|:------------------------| | 4 | rims | 0.611 | 11/18 | 0 | cot2pal.p2c2cot.p2c2pal | | 5 | rims +indiv_eval | 0.5 | 9/18 | 0 | cot2pal.p2c2cot.p2c2pal |

gpt4_3method_conflicts_model_selection_baseline exp success_rate numbers failed prompt
34 baseline 0.167 3/18 0 gpt4_3method_conflicts_model_selection_baseline