Closed carstendraschner closed 1 week ago
might come from:
metric_utils.py line 318 https://github.com/Psycoy/MixEval/blob/4a572edf79e247ac202465025d54965ce277b49d/mix_eval/utils/metric_utils.py#L318
but why is this desired?
hi Carsten,
This random choice is designed for conditions when the judge model fail to parse a correct option, e.g., refuse to answer or does not response in the desired format. In such conditions, we randomly assign an option for them (each option has an equal chance to be selected).
Such conditions are rare (<0.1% cases generally), unless the api model is not correctly configured. In the condition that the judge model is broken, it will report the message as the one you provided.
I will try to discriminate these two kinds of conditions to avoid ambiguation.
Thanks for your response and for your efforts!
When you API is not specified correctly: e.g. the endpoint as judge responds:
It is obvious that the benchmark should not complete as judge is not available. but in my case I get somehow a final response which looks like this:
How can It be that despite we have a judge response
null
that we have a judge openD
This leads to the fact that somehow the benchmark runs through and is evaluatable despite the fact no judge was ever used.If you want to reproduce fast:
regards