bigscience-workshop / lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.
MIT License
101 stars 30 forks source link

How this evaluation is done? #142

Closed a-cavalcanti closed 2 years ago

a-cavalcanti commented 2 years ago

I'm a bit confused about the evaluation. I'm trying to understand the outputs of WNLI task evaluation because we have a pair of probability and boolean values in the model answer:

Answer: (log prob, is-exact-match)

answer = (float(logits.sum()), bool(max_equal))

Here we have the sum of logits. I thought the model would output exactly the word expected by the prompt. Someone could clarify how the evaluation is done?

Because if I use the generate method, the result is very different.

generated_text_samples = model.generate( input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], do_sample=False, max_new_tokens=1, )

StellaAthena commented 2 years ago

The standard way to evaluate generative language models is to view the problem as a multiple choice problem. This is important because there are many equivalent ways a language model might express the same concept and it's not at all clear how to map the diversity of responses to meaningful labels. What the code does is check the logprobs of continuations and return the answer (out of a pre-defined "word bank") that has the highest probability.