OpenLMLab / LEval

[ACL'24 Oral] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
GNU General Public License v3.0
314 stars 13 forks source link

Question on llm_eval #6

Closed dudesummer closed 8 months ago

dudesummer commented 10 months ago

hi, I have questions about the criteria when using llm to eval answers. As the code below,the winner is determined by simple character matching.The error of this simple strategy seems large intuitively,I would like to ask whether you have studied this problem or whether there is a more accurate method of judgment. if judge["output_format"] == "[[A]]": if "[[A]]" in judgment: winner = "A" elif "[[B]]" in judgment: winner = "B" elif "[[C]]" in judgment: winner = "tie" else: winner = "error" else: raise ValueError( f"invalid output format: {judge['output_format']}" )

ChenxinAn-fdu commented 10 months ago

Hi, thank you for the question! For LLM evaluations, we generally follow the setting in short context open-ended tasks benchmarks (paper | repo) and so is the prompt design. In my experiments, GPT-4 can follow this strategy well (0 / 192 rounds ) well but Turbo sometimes generates $error\$ but not too much (30 / 362 rounds).

ChenxinAn-fdu commented 8 months ago

Please feel free to open again if u encounter another problem~