clp-research / clembench

A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark
MIT License
22 stars 31 forks source link

[evaluation] check bencheval score if certain episodes don't have scores file #80

Closed sherzod-hakimov closed 3 months ago

davidschlangen commented 5 months ago

Adding what I know here, for @sherzod-hakimov to fill in later: Apparently the computed scores are not reliable if for some episodes there are no transcripts to score (which can happen if the API rejects the queries). In such cases, the computed score is unrealistically high. Instead the score should be NaN in such a case, to alert us to the fact that transcripts are missing.

phisad commented 5 months ago

Just a side-info: The transcribe command is giving a warning in the console and clembench.log if a interaction.json is missing for an episode.

davidschlangen commented 5 months ago

Maybe just make it exit in that case? This would make it clear(er) that something is amiss?

sherzod-hakimov commented 5 months ago

Update:

So if certain 'scores.json' files are missing then the average is computed for the number of scores files that exist. So it simply ignores the number of episodes that don't have the scores files.

davidschlangen commented 5 months ago

Which is bad, right? This is a description of the (source of the) problem.

phisad commented 5 months ago

I added a test workflow that runs sherzods script and a result badge to the readme for https://github.com/clembench/clembench-runs