Discrepancy in the reported percentage of flawed questions in FastChat MT-Bench

InflectionAI / Inflection-Benchmarks

Public Inflection Benchmarks

MIT License

69 stars 2 forks source link

Discrepancy in the reported percentage of flawed questions in FastChat MT-Bench #2

Open jerilkuriakose opened 8 months ago

jerilkuriakose commented 8 months ago

Hi, I was reading the article Inflection-2.5: meet the world's best personal AI, and in the article it was mentioned that nearly 25%—of examples in the reasoning, math, and coding categories had incorrect reference solutions or questions with flawed premises. I compared the FastChat MT Bench questions and Inflection MT Bench corrected questions and found only 4 questions to have a change / difference.

I downloaded the FastChat MT Bench question using the following code: FastChat LLM Judge

python3 download_mt_bench_pregenerated.py

And compared it with corrected version of the MT-Bench, using mergely. The comparison shows only 4 changes and the 4 changes looks correct in terms of the references provided. Can you please help in understanding how nearly 25% were flawed?

bobqywei commented 8 months ago

Hi Jeril. Thanks for reaching out, the official FastChat LLM Judge repo actually does not use the reference answers contained in the fastchat/llm_judge/data/mt_bench/question.jsonl file which you compared against. The official eval entrypoint python gen_judgment.py actually uses the reference answers from fastchat/llm_judge/data/mt_bench/reference_answer/gpt-4.jsonl which contains many more errors as they are unfiltered completions from GPT-4. In our fixed solutions, we use many of the reference answers from the FastChat jsonl that you linked to above. Ultimately, when comparing our fixed solutions with the GPT-4 generations, there is a larger diff of nearly 25% of questions which use reference solutions (i.e. math, coding, reasoning).

jerilkuriakose commented 8 months ago

Thank you for the update.

jerilkuriakose commented 8 months ago

Could you please help in recommending how to use the MT-Bench? Even if we use the corrected version of the MT-Bench to generate the GPT-4 reference answers, its still going to have around 25% flawed solutions right? Could you also please help me understand how the reference answers were generated for the MT-Bench Corrected in the article Inflection-2.5: meet the world's best personal AI?