Weird random answer if API endpoint is not available

carstendraschner commented 2 weeks ago

When you API is not specified correctly: e.g. the endpoint as judge responds:

Error code: 404 - {'error': {'code': 'DeploymentNotFound', 'message': 'The API deployment for this resource does not exist. If you created the deployment within the last 5 minutes, please wait a moment and try again.'}}
Error in GPT_decode, retrying...

It is obvious that the benchmark should not complete as judge is not available. but in my case I get somehow a final response which looks like this:

{"problem_type": "multiple-choice", "context": "animal", "prompt": "What is the animal trying to accomplish?", "options": ["sand trap", "live long", "leave home", "feel pain", "eating"], "target": [1], "benchmark_name": "CommonsenseQA", "formated_input": "animal\nWhat is the animal trying to accomplish?\nA. sand trap\nB. live long\nC. leave home\nD. feel pain\nE. eating\nAnswer with the option letter from the given choices directly.", "id": "3", "response": "E. eating", 
"judge_response": null, 
"judge_option": "D"}

How can It be that despite we have a judge response nullthat we have a judge open D This leads to the fact that somehow the benchmark runs through and is evaluatable despite the fact no judge was ever used.

If you want to reproduce fast:

cut off eval data to only 1-4 samples
specify wrong environment vars for api endpoint.
let it run for some time (in my case 30min vs 2min if api endpoint is correctly specified)

regards

carstendraschner commented 2 weeks ago

might come from:

metric_utils.py line 318 https://github.com/Psycoy/MixEval/blob/4a572edf79e247ac202465025d54965ce277b49d/mix_eval/utils/metric_utils.py#L318

but why is this desired?

Psycoy commented 2 weeks ago

hi Carsten,

This random choice is designed for conditions when the judge model fail to parse a correct option, e.g., refuse to answer or does not response in the desired format. In such conditions, we randomly assign an option for them (each option has an equal chance to be selected).

Such conditions are rare (<0.1% cases generally), unless the api model is not correctly configured. In the condition that the judge model is broken, it will report the message as the one you provided.

I will try to discriminate these two kinds of conditions to avoid ambiguation.

carstendraschner commented 1 week ago

Thanks for your response and for your efforts!

Psycoy / MixEval

Weird random answer if API endpoint is not available #20