clp-research / clembench

A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark
MIT License
19 stars 26 forks source link

[scoring] ensure that all scores use the full range 0:100 #58

Open davidschlangen opened 4 months ago

davidschlangen commented 4 months ago

At the moment, we are using the aggregated accuracy of the reference game as its experiment score. But this has a per-chance baseline of 33% (one in three, random guess). Ideally, we'd be using something like max(0, (accuray - per_chance) / (100 - per_chance)), to re-scale to 0:100.

But the problem is that the way things are set up at the moment, the scorer only looks at individual episodes (which here yield a binary score), and bencheval aggregates -- but in a game agnostic way.

So it looks like we are left with two not-great alternatives: leave as is (and artificially lift the scores of all models, which will get at least 33 on this game (unless they manage to anti-correlate with the truth...)), or make bencheval work differently by-game.

Need to come back and think about this more.

davidschlangen commented 2 months ago

Of course, hidden behind the max(0, ..) is the fact that some weird models might anticorrelate (which apparently is indeed happening), and reach worse than chance performance, which should give them a negative score, which the max hides.