bigscience-workshop / lm-evaluation-harness

A framework for few-shot evaluation of autoregressive language models.

MIT License

101 stars 30 forks source link

I'm a bit confused about the evaluation. I'm trying to understand the outputs of WNLI task evaluation because we have a pair of probability and boolean values in the model answer:

Answer: (log prob, is-exact-match)

answer = (float(logits.sum()), bool(max_equal))

Here we have the sum of logits. I thought the model would output exactly the word expected by the prompt. Someone could clarify how the evaluation is done?

Because if I use the generate method, the result is very different.

generated_text_samples = model.generate( input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], do_sample=False, max_new_tokens=1, )

bigscience-workshop / lm-evaluation-harness

How this evaluation is done? #142

Answer: (log prob, is-exact-match)