Duplicates in benchmark data

carstendraschner commented 2 months ago

Hi MixEval Team & @Psycoy ,

thanks for your repo and your work to improve open source LLM benchmarks!

Issue: While testing I discovered the following: In (also my mode) response files for mixeval hard freeform there are duplicates in the tasks.

Possible Problem: Is this intentional or is it due to sampling with set back which is not intended?

Assumption: My feeling would be that you would like to have a certain overlap to LM Sys arena but have as many unique tasks as possible, right?

Sample from Repo: https://github.com/Psycoy/MixEval/blob/5fe21c8748337d23ad3e8ed559e4b90b9a1b9c7e/mix_eval/data/model_responses/gemma_11_7b_instruct/mixeval_hard/2024-06-01/gemma_11_7b_instruct_close_freeform_hard.jsonl#L15 (line 15,16,17,18)

How do you see this?

Kind regards Carsten

Psycoy commented 2 months ago

Hi, Carsten,

This is expected, as we are building a mapping from real-world user queries to existing benchmarks. Therefore, there could be multiple user queries mapped to a single benchmark query, thus resulting in duplication. Such duplication is inevitable if we want the closest benchmark query distribution to the real world queries, and is not so harmful as it seems to be–all we want is the final aggregate accuracy score and MixEval provides a good way to aggregate.

carstendraschner commented 2 months ago

Hi, thank you for your feedback and the explanation. What do you think of taking the second best query (and so on) if the best one is already covered? Do you think this would harm the LMSYS overlap significantly?

regards

Psycoy commented 2 months ago

Thanks for the nice comments! I think this is also an option. But I'm doubting that this will only make the benchmark looks better with minimal impact on the model rankings. I will do some experiments under this setting.

Psycoy / MixEval

Duplicates in benchmark data #10