Psycoy / MixEval

The official evaluation suite and dynamic data release for MixEval.
https://mixeval.github.io/
178 stars 25 forks source link

Weird random answer if API endpoint is not available #20

Closed carstendraschner closed 1 week ago

carstendraschner commented 2 weeks ago

When you API is not specified correctly: e.g. the endpoint as judge responds:

Error code: 404 - {'error': {'code': 'DeploymentNotFound', 'message': 'The API deployment for this resource does not exist. If you created the deployment within the last 5 minutes, please wait a moment and try again.'}}
Error in GPT_decode, retrying...

It is obvious that the benchmark should not complete as judge is not available. but in my case I get somehow a final response which looks like this:

{"problem_type": "multiple-choice", "context": "animal", "prompt": "What is the animal trying to accomplish?", "options": ["sand trap", "live long", "leave home", "feel pain", "eating"], "target": [1], "benchmark_name": "CommonsenseQA", "formated_input": "animal\nWhat is the animal trying to accomplish?\nA. sand trap\nB. live long\nC. leave home\nD. feel pain\nE. eating\nAnswer with the option letter from the given choices directly.", "id": "3", "response": "E. eating", 
"judge_response": null, 
"judge_option": "D"}

How can It be that despite we have a judge response nullthat we have a judge open D This leads to the fact that somehow the benchmark runs through and is evaluatable despite the fact no judge was ever used.

If you want to reproduce fast:

regards

carstendraschner commented 2 weeks ago

might come from:

metric_utils.py line 318 https://github.com/Psycoy/MixEval/blob/4a572edf79e247ac202465025d54965ce277b49d/mix_eval/utils/metric_utils.py#L318

but why is this desired?

Psycoy commented 2 weeks ago

hi Carsten,

This random choice is designed for conditions when the judge model fail to parse a correct option, e.g., refuse to answer or does not response in the desired format. In such conditions, we randomly assign an option for them (each option has an equal chance to be selected).

Such conditions are rare (<0.1% cases generally), unless the api model is not correctly configured. In the condition that the judge model is broken, it will report the message as the one you provided.

I will try to discriminate these two kinds of conditions to avoid ambiguation.

carstendraschner commented 1 week ago

Thanks for your response and for your efforts!