[scoring] ensure that all scores use the full range 0:100

At the moment, we are using the aggregated accuracy of the reference game as its experiment score. But this has a per-chance baseline of 33% (one in three, random guess). Ideally, we'd be using something like max(0, (accuray - per_chance) / (100 - per_chance)), to re-scale to 0:100.

But the problem is that the way things are set up at the moment, the scorer only looks at individual episodes (which here yield a binary score), and bencheval aggregates -- but in a game agnostic way.

So it looks like we are left with two not-great alternatives: leave as is (and artificially lift the scores of all models, which will get at least 33 on this game (unless they manage to anti-correlate with the truth...)), or make bencheval work differently by-game.

Need to come back and think about this more.

clp-research / clembench

[scoring] ensure that all scores use the full range 0:100 #58