clp-research / clembench

A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark
MIT License
19 stars 26 forks source link

[framework] Make it easier to select instances that are going to be used -- for example, different languages #36

Closed davidschlangen closed 3 months ago

davidschlangen commented 5 months ago

(This is related to #12 , but looks at it from a different angle.)

Going forward, we want to make the benchmark multilingual. For this, we would need a general mechanism through which at runtime we can determine which instances of a game are run -- where these instance could happen to be in a language other than english.

That is, it should be possible to say something like v2.0_en, which runs a certain set of instances, or v2.1_en, which runs another one (that is what #12 is asking for), but also v2.0_de which runs the German version of v2. If we've set up things correctly so far, this really should only concern the set of instances that is run, and the game code itself (updating the game state, scoring, determining when the game is done) should be language agnostic.

phisad commented 4 months ago

Add an option -i to the CLI command to select the instances.json to be used. The option species a file name (without the .json) which could be instances.v1.5 or instances.v1.5_en and defaults to instances.

phisad commented 4 months ago

Fixed with https://github.com/clp-research/clembench/pull/52

AnneBeyer commented 4 months ago

Making the benchmark multilingual also requires an adjustment of the results directory structure (and in turn also the transcribe/scoring/evaluation scripts, I guess), as this is currently only based on the model, game name, and the sub-experiments, but not on the version of instances, so running a game with different (language) versions will overwrite the results.

sherzod-hakimov commented 4 months ago

maybe we should add an argument -r for the results_path to the cli (directory name to save the results at).

sherzod-hakimov commented 4 months ago

7 - is the same issue, if what I suggested is applicable here.

davidschlangen commented 4 months ago

7 was about the base directory for the results, if I remember correctly.

At the moment, results get written into directories like this: v1.0/claude-2.1-t0.0--claude-2.1-t0.0/imagegame/0_compact_grids. In the future, we will need an additional subdirectory for each game, like so: v1.0/claude-2.1-t0.0--claude-2.1-t0.0/imagegame/v1.0_en/0_compact_grids.

Or actually, would it not be enough to add this to the top level? So that we have something like this: v1.0_en/claude-2.1-t0.0--claude-2.1-t0.0/imagegame/0_compact_grids.

Where is the first v1.0 coming from at the moment?

In any case, I'd worry that leaving this to the call of cli via an -r option would invite mistakes. We need to guarantee consistency across games in the naming of this. And the name depends on the name of the instances. So maybe the place to enforce consistency is there (and only once)?

davidschlangen commented 4 months ago

Yeah, so, maybe not on the top level?

A version of the benchmark (the top level v1.0) may in the future comprise many languages. Which however we want to keep separated in the results directory. So the semantics would be something like this:

BENCHMARK_VERSION/MODEL-PAIR/GAME/INSTANCE_VERSION_LANGUAGE/SUBEXPERIMENT.

And the instance version may or may not be different from the benchmark version.. So maybe we add v2.4 of some game's instances to benchmark v2...

Or do we want even more levels?

BENCHMARK_VERSION/MODEL-PAIR/GAME/LANGUAGE/INSTANCE_VERSION/SUBEXPERIMENT.

Even more? DAY_OF_THE_WEEK???

phisad commented 4 months ago

A question now would be if we want to score and transcribe everything or should a specific instances version be (optionally) selectable here? Like during the run the -i option?

davidschlangen commented 4 months ago

I guess making this optionally selectable can't hurt, in cases where someone is experimenting with instances. But scoring is not a destructive operation, right, so this at best just saves time that would be wasted on scoring things that have already been scored. At the cost of some complexity in the code?

davidschlangen commented 4 months ago

I have realised that we had already talked about and potentially solved the main (conceptual) issue here. Citing from the (internal) versioning document (which maybe I should move over here?):

The set of game realisations (i.e., prompts + game logic) defines the major version number. That is, any time the set of games is changed, a new version number is required -- and all included models need to be re-run agains the benchmark. Result scores are generally not expected to be comparable across major versions of the benchmark.

The set of instances of each realisation defines the minor number. E.g., when new target words for the wordle game are selected, then this constitutes are new minor version of the benchmark. Results scores are likely comparable across minor version updates, although whether this really is the case is an empirical question that needs more thorough investigation. Assuming it is, this means that a minor update does not necessitate a full re-run of the benchmark against all models.

In the future, we may want to work with multilingual game realisations. In that case, the monolingual version of the benchmark will be denoted by the suffix en (e.g., v1.0en), and the multilingual version by ml (e.g., v1.0en). If results are broken down by language, the ISO 639-1 identifier is attached (v1.0de).

Maybe this is actually the cleanest solution. On the level of keeping the results, we put the distinctions on the top level directory. Each version of the benchmark covers only one language, and one set of instances. So the structure would be: BENCHMARK_VERSION_LANGUAGE/MODEL-PAIR/GAME/SUBEXPERIMENT.

If the set of instances changes, the minor version of the benchmark version changes. But inside of the results, there is no further subdivision, and it looks like before.

This creates the notion of the "multilingual super benchmark", which collects various language-specific benchmarks. For example, mling-clembench-2.0 could collect and aggregate all results from v2.0_eng, v2.0_deu, etc.

What's the upshot of this?

This puts the responsibility of ensuring consistency on whoever runs the benchmark on a model. They must make sure the input instances match to the output root. But we can make that easier by establishing a convention. Instances must be in directories that correspond to the benchmark version. So the call always looks like this: --instances v1.3_en --results_root v1.3_en.... Which would suggest that results_root can be inferred from instances.

Ok. It looks like we (or rather I) have been running in circles. It's actually all pretty straightforward?

AnneBeyer commented 4 months ago

Coming back to this: So the main thing we need to change for the suggestion above would be the results_root, which is currently hard coded in file_utils.py (line 15) and will need to be set according to the instances (with results as the default if the input is just instances.json to make this non-breaking).

evaluation/bencheval.py already has an option to specify the results directory, but this also requires transcribe and score to know about the version (aka. results_root). I got lost when trying to find where that should be adapted, @phisad, could you look into this, please?

(We talked about just adding different language versions as sub-experiments, but this complicates things if an experiment already has different sub experiments (which most of them do), because then we would need to adapt the scoring to group them together by language again, and also running a separate language would require running all experiments that contain the given language identifier, which could be a more complicated solution, depending on how difficult the transcribe/score adaptions turn out to be...)

phisad commented 4 months ago

So @briemadu and I had the idea to simply introduce a secondary level that covers the instance suffix e.g. results/en_v1.5 for instances_en_v1.5. This is good, because the eval script takes basically everything under results using glob without caring too much about the sub-tree.

(The language thingy was only related to german prompt -- de_v1.5 so to speak -- with english words which could be an experiment on its own)

davidschlangen commented 4 months ago

So the idea is that we have results/LANGUAGE-CODE_BENCHMARK-VERSION/MODEL-PAIR/GAME/SUBEXPERIMENT, right? Or why don't we go for results/BENCHMARK-VERSION/LANGUAGE-CODE/MODEL-PAIR/GAME/SUBEXPERIMENT. Then everything belonging to one benchmark version is neatly packed together, and within that, everything belonging to the same language.

Oh and yes, let's determine now that we use ISO 639-1 (two letter codes) for the language code.

phisad commented 4 months ago

That sounds fine. Now we could loosely align the instance selection by specifying compatible suffixes e.g. instances_<iso>_<version>.json OR we also adjust the instance selection meaning that we have the same substructure there: in/VERSION/LANG/instances.json and the cli option become -v (version) and -i (iso_code).

By the way, the default behavior could be to only create sub-dirs if there is information e.g.

OR we could simply add placeholders e.g. results/default/default/

phisad commented 3 months ago

OK, we simplify here. The user can continue to use the -i option to select the instances e.g. -i instances_v1.5_en.json.

In addition the user can specify a result directory structure using -r e.g. -r v1.5/en so that all the results are going into results/v1.5/en. The user has to specify this manually. If -r is not specified the directory will be results.

In addition the user might specify an absolute path e.g. -r /a/place/on/my/machine which actually solves https://github.com/clp-research/clembench/issues/7

davidschlangen commented 3 months ago

Note: Despite there (going to be) defaults, we never want results of official runs that do not have a benchmark version identifier and a language identifier.

phisad commented 3 months ago

Fixed with https://github.com/clp-research/clembench/pull/66