Benchmark Runs
Leaderboard of all runs is available here: Clem Leaderboard
Versions
v0.9 - June 2023
v1.0 - November 2023
v1.5 - March 2024
v1.6 - May 2024
Supported Models
The list of supported open & closed/commercial models can be found here: model registry
Game-play files
Each model has a separate folder for each game result. The outputs are organised as follows: /model/game/experiment
. Each episode under a certain experiment includes the following files:
- instance.json : info about a certain episode including the prompt text
- interactions.json: interaction among players and game master
- requests.json: given inputs and generated outputs for the tested model
- scores.json: generated scores for the episode and turn level
- transcript.html: transcript of the dialogue in HTML
- transcript.tex: transcript of the dialogue in LaTeX
Results files
Each run of the benchmark generates CSV and HTML files for all tested models across all games (results.csv
& results.html
).