EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
6.73k stars 1.79k forks source link

Clarification on Multiple Choice Question Handling in LM-Harness #2136

Open monk1337 opened 2 months ago

monk1337 commented 2 months ago

I am currently exploring the internal workings of the LM-Harness library and am considering contributing to its documentation to help clarify some of its mechanisms for future users.

I have a query regarding the handling of multiple-choice questions (MCQs) within the library. From what I gather, each MCQ is broken down into separate inputs corresponding to each choice, and each is then evaluated by the model. Could you please confirm if my understanding is correct, and elaborate on how exactly the library processes these inputs to select the most probable answer?

To illustrate my question with an example:

Context: "For a physics exam:"
Question: "What is the speed of light?"
Choices:

A. "300,000 km/s"
B. "150,000 km/s"
C. "299,792 km/s" (Correct)
D. "None of the above"

As I understand it, the library constructs individual requests for each choice as follows, using task.py:972

Each of these inputs is then fed into the model. Could you clarify whether the library simply extracts the first token's response, or if there's a more complex mechanism at play involving the vocabulary to determine which choice is the most likely correct answer?

Then the answers are fed into the model.

For a physics exam: What is the speed of light? A: 300,000 km/s -> model

and output might look like this?

{'created': 1721875422, 'model': 'llama3.1', 'usage': {'prompt_tokens': 82, 'completion_tokens': 58, 'total_tokens': 140}, 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "That's correct! The speed of light in a vacuum is approximately 300,000 kilometers per second (km/s), which is equivalent to about 186,282 miles per second (mi/s). This is a fundamental constant in physics, denoted by the letter c. Well done!", 'function_call': None}, 'finish_reason': 'stop', 'logprobs': {'tokens': ['That', "'s", ' correct', '!', ' The', ' speed', ' of', ' light', ' in', ' a', ' vacuum', ' is', ' approximately', ' ', '300', ',', '000', ' kilometers', ' per', ' second', ' (', 'km', '/s', '),', ' which', ' is', ' equivalent', ' to', ' about', ' ', '186', ',', '282', ' miles', ' per', ' second', ' (', 'mi', '/s', ').', ' This', ' is', ' a', ' fundamental', ' constant', ' in', ' physics', ',', ' den', 'oted', ' by', ' the', ' letter', ' c', '.', ' Well', ' done', '!', '<|eot_id|>'], 'token_logprobs': [-0.33203125, -0.6328125, -0.7578125, -0.043945312, -0.0022125244, -0.00019073486, -3.2186508e-06, -7.1525574e-06, -0.38867188, -0.22558594, -8.821487e-06, -0.000111579895, -0.034179688, -0.0012207031, -0.5234375, -3.0994415e-05, -1.180172e-05, -0.036865234, -4.4107437e-06, -1.9669533e-05, -0.003967285, -0.004547119, -5.722046e-06, -1.9140625, -1.3203125, -0.011108398, -0.625, -3.2186508e-06, -2.328125, -0.00016212463, -0.0390625, -2.2411346e-05, -0.043945312, -8.8214874e-05, -5.2452087e-06, -0.0003414154, -0.43554688, -0.005218506, -1.0609627e-05, -0.1328125, -1.1796875, -0.58203125, -0.021240234, -0.00064468384, -0.004699707, -0.20214844, -0.0022125244, -0.8359375, -0.21386719, -5.9604645e-07, -0.0040893555, -0.0001745224, -0.1796875, -0.16601562, -0.30273438, -0.89453125, -0.00013256073, -0.035888672, -0.13671875], 'token_ids': [4897, 596, 4495, 0, 578, 4732, 315, 3177, 304, 264, 29302, 374, 13489, 220, 3101, 11, 931, 41668, 824, 2132, 320, 16400, 2754, 705, 902, 374, 13890, 311, 922, 220, 9714, 11, 16544, 8931, 824, 2132, 320, 8318, 2754, 570, 1115, 374, 264, 16188, 6926, 304, 22027, 11, 3453, 9437, 555, 279, 6661, 272, 13, 8489, 2884, 0, 128009]}}]}

Additionally, I am curious about how the library calculates the accuracy (ACC) from the log probabilities obtained for each choice. Specifically, are we considering the probability of the first token, or are we summing the probabilities across tokens? Could you direct me to the specific lines of code in the LM-Evaluation-Harness that handle this calculation, and provide a brief explanation of the method used?

baberabb commented 2 months ago

Hi! I think this discussion should answer your question: https://github.com/EleutherAI/lm-evaluation-harness/issues/942

with respect to #2135, likelihood based tasks aren't supported with chat-completion endpoints as yet. I have made a PR to make that clearer. Also I don't think it's possible with the api you are using as it doesn't seem to support an echo like arg where the logprobs of the prompt (rather than the model generations) are returned.

monk1337 commented 2 months ago

@baberabb Thank you for the response and for pointing to the references. I wanted to confirm: if the API includes an echo feature (returning the log probabilities of the prompt as well), will it be compatible with the lm-harness? Is there anything else we need to include in the API output to ensure compatibility with the LM harness?

baberabb commented 2 months ago

yeah thats the main one. And not many APIs support echo with chat inputs tbf. together is the only one I know of. I don't think even vllm supports it with their chat-completions endpoint iirc.

owos commented 1 week ago

@baberabb could you confirm if @monk1337 is right about the question formatting? If yes, does that mean that harness will receive these options (phrases) individually and you get the likelihood of each option and then , the highest likelihood becomes the correct answer?