Closed sherzod-hakimov closed 5 months ago
Just a side-info: The transcribe command is giving a warning in the console and clembench.log if a interaction.json is missing for an episode.
Maybe just make it exit in that case? This would make it clear(er) that something is amiss?
Update:
So if certain 'scores.json' files are missing then the average is computed for the number of scores files that exist. So it simply ignores the number of episodes that don't have the scores files.
Which is bad, right? This is a description of the (source of the) problem.
I added a test workflow that runs sherzods script and a result badge to the readme for https://github.com/clembench/clembench-runs
Adding what I know here, for @sherzod-hakimov to fill in later: Apparently the computed scores are not reliable if for some episodes there are no transcripts to score (which can happen if the API rejects the queries). In such cases, the computed score is unrealistically high. Instead the score should be NaN in such a case, to alert us to the fact that transcripts are missing.