Validate MNLI - Githubissues

PatrykNeubauer commented 1 year ago

Hi, I'll be working on this one!

StellaAthena commented 1 year ago

@PatrykNeubauer any progress?

PatrykNeubauer commented 1 year ago

Hey, sorry but nothing concrete yet, as I've been mostly getting up to speed on the library and evaluation of LLMs in general.

What I've found:

While the MNLI itself isn't present, the prompt format is consistent with the one used for ANLI in GPT-3 and MT-NLG papers.
MNLI is not present in LLaMa paper.
Neither MNLI nor GLUE papers suggest a prompt format.
T5 is trained with MNLI as one of the tasks using "mnli hypothesis: {{hypothesis}} premise: {{premise}}" format, with results in table 16.
Meta has evaluated FLAN, LaMDa, OPT and OPT-IML with PromptSource input formats, where one of the formats is similar to the one used here.
Something I've mentioned in the discord is that, (as you know best) is that papers like T0 or BLOOM test different prompts for this task from PromptSource.

Two possible sources of inconsistencies I've noticed:

lm-eval-harness uses a single new line between hypothesis and premise, while PromptSource uses two
not all sentences in MNLI end with punctuation, making the premise not gramatically correct e.g. "The loophole is now gone True, False, or Neither?"

PatrykNeubauer commented 1 year ago

Ran the evaluation on few models, tried out both the current version of the prompt and a slightly modified one with extra \n to see how a minor change like that affects the results.

To summarize:

MNLI is rarely used in the "big" LLM papers, however the results are around what these models normally get in NLI tasks (e.g. ANLI).
From all the papers (skipping ones affiliated with EutherAI) I've checked out, only OPT-IML and FLAN evaluate on MNLI, both using the FLAN benchmark. But similarily to PromptSource, FLAN evaluates on few different prompts at once, making comparison meaningless.
- What's in their repo is however not consistent with their paper, where they mention using "Does {{premise}} mean that {{hypothesis}}?", later change it to "Premise: {{premise}}\nHypothesis: {{hypothesis}}\nDoes the premise entail the hypothesis?" (p. 5 and 30).
- Also on p. 30 they mention that models not fine-tuned for instructions, completely fail on their format of the task, resulting to also using the GPT-3 format for LaMDA-PT in their paper.
As older benchmarks (2018), neither GLUE nor MNLI suggest a prompt format, they just use softmax classifiers.

So:
I'd recommend leaving this task as it is, since that seems to be the most established format - used by OpenAI in GPT-3 (p. 51), Microsoft in MT-NLG (p. 15) and Google in FLAN (p. 30).
- Only inconsistency seems to be how "Answer:" is added to the prompt, with it even being missing in the GPT-3 paper, however looking at other tasks, current "{}\nQuestion: {} True, False or Neither?\nAnswer:" seems to be the best option.
Side-note: this is slightly different from the most similar prompt in PromptSource - MNLI.

With single \n:

GPT

hf-causal (pretrained=gpt2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3372	±	0.0048

gpt3 (engine=davinci), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3943	±	0.0049

gpt3 (engine=text-davinci-003), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.6456	±	0.0048

OPT

hf-causal (pretrained=facebook/opt-125m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3447	±	0.0048

hf-causal (pretrained=facebook/opt-350m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3447	±	0.0048

hf-causal (pretrained=facebook/opt-1.3b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3583	±	0.0048

T5

hf-seq2seq (pretrained=t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.5673	±	0.005

hf-seq2seq (pretrained=google/flan-t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.6674	±	0.0048

With extra \n:

GPT

hf-causal (pretrained=gpt2), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3376	±	0.0048

gpt3 (engine=text-davinci-003), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.6422	±	0.0048

OPT

hf-causal (pretrained=facebook/opt-125m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3536	±	0.0048

hf-causal (pretrained=facebook/opt-350m), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3452	±	0.0048

hf-causal (pretrained=facebook/opt-1.3b), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.3583	±	0.0048

T5

hf-seq2seq (pretrained=t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.5673	±	0.005

hf-seq2seq (pretrained=google/flan-t5-base), limit: None, provide_description: False, num_fewshot: 0, batch_size: None	Task	Version	Metric	Value		Stderr
mnli	0	acc	0.6674	±	0.0048

(perhaps woth noticing that opt-125m got better than opt-350m with this format)

StellaAthena commented 1 year ago

This is an excellent report! I feel comfortable adopting this as our Officially Recommended Format now :)

EleutherAI / lm-evaluation-harness

Validate MNLI #450

As older benchmarks (2018), neither GLUE nor MNLI suggest a prompt format, they just use softmax classifiers.

GPT

OPT

T5

GPT

OPT

T5