Closed dudesummer closed 8 months ago
Hi, thank you for the question! For LLM evaluations, we generally follow the setting in short context open-ended tasks benchmarks (paper | repo) and so is the prompt design. In my experiments, GPT-4 can follow this strategy well (0 / 192 rounds ) well but Turbo sometimes generates $error\$
but not too much (30 / 362 rounds).
Please feel free to open again if u encounter another problem~
hi, I have questions about the criteria when using llm to eval answers. As the code below,the winner is determined by simple character matching.The error of this simple strategy seems large intuitively,I would like to ask whether you have studied this problem or whether there is a more accurate method of judgment.
if judge["output_format"] == "[[A]]": if "[[A]]" in judgment: winner = "A" elif "[[B]]" in judgment: winner = "B" elif "[[C]]" in judgment: winner = "tie" else: winner = "error" else: raise ValueError( f"invalid output format: {judge['output_format']}" )