clp-research / clembench

A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark
MIT License
22 stars 31 forks source link

[wordle] reconsider what is used as quality score #90

Closed davidschlangen closed 2 weeks ago

davidschlangen commented 4 months ago

Might be better to use binary success as quality score (which enters into clemscore).

At the moment, it's speed: 100 / # turn at which solved. But that means that 100 is pure luck (as there are no positional clues at the first attempt).

It might be better if "62" means "62% of all games that were successfully played / not aborted were solved in 6 or less turns".

davidschlangen commented 2 weeks ago

This has been addressed with 1.5 (or 1.6). New rules is "100 for 3 guesses or fewer, 50 for 4, 30 for 5, 20 for 6, 0 for fail."