clp-research / clembench

A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark
MIT License
19 stars 26 forks source link

[leaderboard] rethink approach to deciding on which models to test #83

Open davidschlangen opened 2 months ago

davidschlangen commented 2 months ago

We need to get more clarity on what we want to achieve with the leaderboard, and then think about how we can achieve it.

A) Do we want to mirror other leaderboards? Rule could be: And any time, we strive to have numbers for the 30 best-performing models on ChatArena.

B) Do we want to identify the pareto frontier of size/performance, independently of a model's ranking elsewhere? That would be great, but makes the search space too large. We need to limit the number of models we test.

C) Are there certain models for which we want to know numbers, regardless of performance elsewhere? I guess. Rule could be: Test "big name models" (e.g., Llama-3) once they become available, to allow us to set expectations (e.g., how good derivatives might be).

My guess would be that a combination of A) and C) would be best. This limits testing to 30 models, with some fluctuation. Could even be automated: Once a model has appeared on ChatArena that we have not tested, we run the benchmark.

(We want to automatically parse their list anyway, to check for rank correlations.)

davidschlangen commented 2 months ago

This didn't actually address the question of what we want to achieve. My proposal for that would be something like:

To provide an up-to-date overview of the performance of the most prominent LLMs (closed and open) as conversational agents (to the extent that it measured by our instrument); in order to help ourselves and others to make decisions on what to use for related purposes.