Open jerilkuriakose opened 8 months ago
Hi Jeril. Thanks for reaching out, the official FastChat LLM Judge repo actually does not use the reference answers contained in the fastchat/llm_judge/data/mt_bench/question.jsonl file which you compared against. The official eval entrypoint python gen_judgment.py
actually uses the reference answers from fastchat/llm_judge/data/mt_bench/reference_answer/gpt-4.jsonl which contains many more errors as they are unfiltered completions from GPT-4. In our fixed solutions, we use many of the reference answers from the FastChat jsonl that you linked to above. Ultimately, when comparing our fixed solutions with the GPT-4 generations, there is a larger diff of nearly 25% of questions which use reference solutions (i.e. math, coding, reasoning).
Thank you for the update.
Could you please help in recommending how to use the MT-Bench? Even if we use the corrected version of the MT-Bench to generate the GPT-4 reference answers, its still going to have around 25% flawed solutions right? Could you also please help me understand how the reference answers were generated for the MT-Bench Corrected in the article Inflection-2.5: meet the world's best personal AI?
Hi, I was reading the article Inflection-2.5: meet the world's best personal AI, and in the article it was mentioned that
nearly 25%—of examples in the reasoning, math, and coding categories had incorrect reference solutions or questions with flawed premises
. I compared the FastChat MT Bench questions and Inflection MT Bench corrected questions and found only 4 questions to have a change / difference.I downloaded the FastChat MT Bench question using the following code: FastChat LLM Judge
And compared it with corrected version of the MT-Bench, using
mergely
. The comparison shows only 4 changes and the 4 changes looks correct in terms of the references provided. Can you please help in understanding hownearly 25%
were flawed?