EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
7.02k stars 1.88k forks source link

Validate MNLI #450

Closed StellaAthena closed 1 year ago

PatrykNeubauer commented 1 year ago

Hi, I'll be working on this one!

StellaAthena commented 1 year ago

@PatrykNeubauer any progress?

PatrykNeubauer commented 1 year ago

Hey, sorry but nothing concrete yet, as I've been mostly getting up to speed on the library and evaluation of LLMs in general.

What I've found:

Two possible sources of inconsistencies I've noticed:

PatrykNeubauer commented 1 year ago

Ran the evaluation on few models, tried out both the current version of the prompt and a slightly modified one with extra \n to see how a minor change like that affects the results.

To summarize:


With single \n:

GPT

hf-causal (pretrained=gpt2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.3372 ± 0.0048
gpt3 (engine=davinci), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.3943 ± 0.0049
gpt3 (engine=text-davinci-003), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.6456 ± 0.0048

OPT

hf-causal (pretrained=facebook/opt-125m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.3447 ± 0.0048
hf-causal (pretrained=facebook/opt-350m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.3447 ± 0.0048
hf-causal (pretrained=facebook/opt-1.3b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.3583 ± 0.0048

T5

hf-seq2seq (pretrained=t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.5673 ± 0.005
hf-seq2seq (pretrained=google/flan-t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.6674 ± 0.0048

With extra \n:

GPT

hf-causal (pretrained=gpt2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.3376 ± 0.0048
gpt3 (engine=text-davinci-003), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.6422 ± 0.0048

OPT

hf-causal (pretrained=facebook/opt-125m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.3536 ± 0.0048
hf-causal (pretrained=facebook/opt-350m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.3452 ± 0.0048
hf-causal (pretrained=facebook/opt-1.3b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.3583 ± 0.0048

T5

hf-seq2seq (pretrained=t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.5673 ± 0.005
hf-seq2seq (pretrained=google/flan-t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None Task Version Metric Value Stderr
mnli 0 acc 0.6674 ± 0.0048

(perhaps woth noticing that opt-125m got better than opt-350m with this format)

StellaAthena commented 1 year ago

This is an excellent report! I feel comfortable adopting this as our Officially Recommended Format now :)