Closed StellaAthena closed 1 year ago
@PatrykNeubauer any progress?
Hey, sorry but nothing concrete yet, as I've been mostly getting up to speed on the library and evaluation of LLMs in general.
What I've found:
"mnli hypothesis: {{hypothesis}} premise: {{premise}}"
format, with results in table 16.Two possible sources of inconsistencies I've noticed:
"The loophole is now gone True, False, or Neither?"
Ran the evaluation on few models, tried out both the current version of the prompt and a slightly modified one with extra \n
to see how a minor change like that affects the results.
To summarize:
"Does {{premise}} mean that {{hypothesis}}?"
, later change it to "Premise: {{premise}}\nHypothesis: {{hypothesis}}\nDoes the premise entail the hypothesis?"
(p. 5 and 30).So:
"Answer:"
is added to the prompt, with it even being missing in the GPT-3 paper, however looking at other tasks, current "{}\nQuestion: {} True, False or Neither?\nAnswer:"
seems to be the best option.With single \n
:
hf-causal (pretrained=gpt2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.3372 | ± | 0.0048 |
gpt3 (engine=davinci), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.3943 | ± | 0.0049 |
gpt3 (engine=text-davinci-003), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.6456 | ± | 0.0048 |
hf-causal (pretrained=facebook/opt-125m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.3447 | ± | 0.0048 |
hf-causal (pretrained=facebook/opt-350m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.3447 | ± | 0.0048 |
hf-causal (pretrained=facebook/opt-1.3b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.3583 | ± | 0.0048 |
hf-seq2seq (pretrained=t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.5673 | ± | 0.005 |
hf-seq2seq (pretrained=google/flan-t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.6674 | ± | 0.0048 |
With extra \n
:
hf-causal (pretrained=gpt2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.3376 | ± | 0.0048 |
gpt3 (engine=text-davinci-003), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.6422 | ± | 0.0048 |
hf-causal (pretrained=facebook/opt-125m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.3536 | ± | 0.0048 |
hf-causal (pretrained=facebook/opt-350m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.3452 | ± | 0.0048 |
hf-causal (pretrained=facebook/opt-1.3b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.3583 | ± | 0.0048 |
hf-seq2seq (pretrained=t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.5673 | ± | 0.005 |
hf-seq2seq (pretrained=google/flan-t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None | Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|---|
mnli | 0 | acc | 0.6674 | ± | 0.0048 |
(perhaps woth noticing that opt-125m got better than opt-350m with this format)
This is an excellent report! I feel comfortable adopting this as our Officially Recommended Format now :)
Hi, I'll be working on this one!