Closed a-cavalcanti closed 2 years ago
The standard way to evaluate generative language models is to view the problem as a multiple choice problem. This is important because there are many equivalent ways a language model might express the same concept and it's not at all clear how to map the diversity of responses to meaningful labels. What the code does is check the logprobs of continuations and return the answer (out of a pre-defined "word bank") that has the highest probability.
I'm a bit confused about the evaluation. I'm trying to understand the outputs of WNLI task evaluation because we have a pair of probability and boolean values in the model answer:
Answer: (log prob, is-exact-match)
answer = (float(logits.sum()), bool(max_equal))
Here we have the sum of logits. I thought the model would output exactly the word expected by the prompt. Someone could clarify how the evaluation is done?
Because if I use the generate method, the result is very different.
generated_text_samples = model.generate( input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], do_sample=False, max_new_tokens=1, )