clp-research / clembench

A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark
MIT License
19 stars 26 forks source link

[games] make “abort” consistent #95

Open davidschlangen opened 2 weeks ago

davidschlangen commented 2 weeks ago

We should reserve “abort” for “has repeatedly not followed formatting instructions / produced unparseable output”. At least MapWorld seems to interpret it differently and also include “has reached max # of turns” in conditions that trigger abort.

Why should we do that? Because then % played is really fully interpretable as “follows formatting instructions” (and nothing else).