clp-research / clembench

A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark
MIT License
23 stars 32 forks source link

[framework] game registry #99

Open davidschlangen opened 3 months ago

davidschlangen commented 3 months ago

Instead of making assumptions about where the code for a game has to live in order to run the game via cli.py, we could introduce a game_registry.json (in the style of the model_registry.json) that is the only thing that needs to live in a known location. (Suggestion by Sherzod.)

The entries for each game in there could then point to a directory where the code is, besides holding all kinds of other information about the game that could be useful for whatever consumer. E.g., image: none | single | multiple, etc. etc. The scripts for running a benchmark could then filter on these entries, and automatically pull together what is needed.

But mostly, this would make it possible to move the game code outside of the main repository. We could still make default assumptions (for example, that it lives in a sister directory, e.g. if this is cb-code/clembench, it lives in cb-code/clemgames, and entries in the registry could have a relative path ../clemgames/GAME), but with our usual mechanism, these could be overwritten or adapted, if you want to put games elsewhere.

davidschlangen commented 3 months ago

Any (idle, parental leave adequate) thoughts on this, @phisad ? Would that require dramatic changes to the framework? Or is it rather something that the eval scripts would need to care about, @briemadu ?

briemadu commented 3 months ago

The eval scripts rely on the structure of results/. So as long as the framework keeps saving the results the same way, nothing would have to change in the eval for implementing this change.

phisad commented 3 months ago

You can adjust the games loading code here:

https://github.com/clp-research/clembench/blob/c6a4546c5786b03d2e1626944726596bc412210d/clemgame/__init__.py#L37-L47

davidschlangen commented 4 weeks ago

Thinking a bit more about this.

What do we want to achieve?

Thinking about this from the perspective of cli.py, we'd want a mechanism that is similar to the model registry mechanism.

What could an entry of the registry look like?

{
  "name": "taboo",
  "code": "../clp-games/taboo",
  "collections": [2.0]
}

How are locations denoted? Would it be enough to have the conventions that game code by default lives in sister directories? So at least in the official game_registry.json, entries are always to ../something. We can also have a game_registry_custom.json that people can use while developing their stuff; there they can put absolute codes if they want to.

What I am not sure about is how to realise the capability of selecting by properties, or what it even means... I thought it could be nice to have this be a list, so that the same game can be part of more than one collection. But that doesn't work with the current unification mechanism at least. Also it doesn't really work with the way things are set up at the moment, because our version numbers do not only reference changes in instances, but can also reference changes in code. So to reconstruct a particular version of the benchmark, one needs to make sure that the code that is being linked to is at the right version.

Aha. Maybe this could work:

{
  "name": "taboo",
  "c-id": "v1.5",
  "code": "../clp-games-v1.5/taboo"
}

This means that for reproducability purposes, if one wants to re-run an older benchmark, one needs to check out the games repository at a particular revision, then rename it to conform with this, and then can call -g '{"c-id": "v1.5"}, which would unify with all game specs that match this. So this is actually elegant. (Although it would mean that each (benchmark-)version of a game needs its own entry. But so be it.) Consequence would be though that each game can be in at most one collection, and games that aren't in any need to specify something like {"c-id": "none"}.

Alright. Looks like this wouldn't be too dramatic a change. The game loading / identification code needs to change, but hopefully the rest can stay. (Problem might be the resource location methods?)

What are the chances that most of the code for dealing with spec jsons can be re-used, @phisad ?

AnneBeyer commented 4 weeks ago

@phisad, don't worry about it, I started looking into this here.

davidschlangen commented 4 weeks ago

Just adding a note here. This:

{
  "name": "taboo",
  "c-id": "v1.5",
  "code": "../clp-games-v1.5/taboo"
}

would not work, as it would lead to multiple entries with "name": "taboo". To ensure that the latest is found (assuming that that is the intent when one just uses the name), one would have to rely on accidental facts about the loading (the first entry that matches is used).

But I think what is valid is that older versions need to be identified via their code. The understanding would have to be that an entry in the registry denotes a combination of code and instances (because that's what's packaged up together).

So the information that a particular game has been part of several versions of the benchmark doesn't belong to a single entry, if each entry only denotes game-in-particular-version-of-benchmark.

(There are two goals here I think which only partially overlap: One is to be able to easily access collections of games, and the other is being able to reproduce older versions of the benchmark.)

AnneBeyer commented 4 weeks ago

Here's a first template for the game registry @sherzod-hakimov and I briefly discussed yesterday:

{
"game_name": game identifier
"game_path": path to game  # relative to clemgame directory (or absolute in game_registry_custom.json)
"description": "A brief description of the game"
"main_game": "main game identifier" # to cluster different versions of the same game
"player": "single" | "two" | "multi" # not sure if this will be relevant, but we decided to add it anyways
"image": "none" | "single" | "multi"
"languages": ["en"] # list of ISO codes
"benchmark": ["X.X", "Y.Y"] # lists all benchmark versions in which this game was used 

# The games that are part of a specific collection can be filtered based on the 
# game attributes.
# For reproducibility, "benchmark" will also list all benchmark versions a game has   
# been used in previously
}

My approach to the collections would then be similar to what we already discussed for the instance files (and results structure): A collection could be denoted by collection_X.X_uni/multimodal_language (where the default for the latter two is text-only and English, if not specified) and we then check if the name starts with "collection" (otherwise we just load the game by name) and filter the list of available games accordingly.

And how about addressing the reproducibility aspect by marking each version as a Release in the new game repository (assuming that we are able to mirror the "evolution" of the games there)? This still requires a manual checkout of the required version, but the model registry would not need to be changed (assuming the version above).

davidschlangen commented 4 weeks ago

Can you elaborate on this? I don't understand what collection_X.X_uni/multimodal_language would resolve to.

One desideratum that I see is to make this work as much as possible without changing the mechanism that we arrived at for the model registry; or at the very least only adding to that mechanism. (Not only because it's quite elegant, but also because we don't want to duplicate functionality.)

So when thinking about this, one element that's important to understand is that that mechanism works via unification. (At least that was the idea...) So basically, the specification (= feature structure) that is used (= used to select the backend, and then passed on to it) is the first specification that unifies with the specification (feature structure) that is given.

This makes it possible to find a fully specified specification just by giving one feature value (e.g., `"name": "gpt-4"), because that will find the first one that has this value for this feature; all other feature values will then come from that entry. But it also makes it possible to extend an existing entry, by specifying a feature that it doesn't mention, and it makes it possible from blocking an existing entry from matching, by specifying a feature differently (and hence making what it passed on the command line the full entry).

Another nice feature is that the consumer (in the case of the model registry, the backend) will just ignore features that it doesn't care about. So this -- together with the fact that there is only a convention for the structure, and not a real schema -- makes it possible to stick additional information into the registry, that may be used elsewhere (e.g., in the model registry, the number of parameters, which is just used for creating plots).

Something that does not yet work, and if we want it, would require some thinking, is unification into sets or lists. So selecting something that is specified as "benchmark": ["1.0", "1.5"] via "benchmark": "1.5" wouldn't (yet) work.

davidschlangen commented 4 weeks ago

I think the first thing to get clarity about is what an entry in the registry is meant to specify: Is it a particular game as such (in which case it makes sense to list all benchmark versions it was involved in), or is it -- more modestly -- one directory containing game code and instances?

I'm tending towards the latter, because then, together with your proposal, we could easily capture the reproducibility use case:

{
  "game_name": "taboo-1.5",
  "game_type": "taboo",
  "game_path": "../clembench-games-1.5/taboo",
  # if you want to run this, check out release 1.5 of the clembench-games repository,
  # rename it to clembench-games-1.5, and place it in the parent directory of this one
}

(This would mean however that with each new release, we need to add a batch of entries like the one above. The "unmarked" entry ("game_name": "taboo", "game_path": "../clembench-games/taboo") would always denot the current one.)

davidschlangen commented 4 weeks ago

But in general, I like the idea of putting all of that information in there. It should be possible to ask questions like "what are all games (game directories) that use multiple images?" or "what are all games (game directories) for which Spanish instances exist?".

AnneBeyer commented 4 weeks ago

To answer your question (now a bit further) above: This could for example resolve to collection_1.5 (which would select all text-only English games marked for version 1.5) or collection_1.5_multimodal (which would select all English games marked as using images and marked for version 1.5) or collection_1.5_ru (which would select all Russian games marked as text-only for version 1.5) or collection_1.5_multimodal_ru (which would select all Russian games marked as using images and marked for version 1.5)

I did actually understand the unification approach and will look into how this can be used for the game registry once I have a working version for loading the games interactively in general.

davidschlangen commented 4 weeks ago

But how? Where does one specify what collection_1.5 refers to? It's not using the unification mechanism / it's not "give me all specs (rather than just the first) that unify with this partial description", or is it?

AnneBeyer commented 3 weeks ago

As far as I understood, the use case is a bit different here than with the backends. I thought we want to be able to select a collection, i.e., all specs that satisfy a given description (such as "benchmark" containing the given version number or "language" containing a specific version, or even "main_game" containing a specific identifier, such as "wordle", to run all available variants of a game). Only when a specific game name is given, we want to retrieve only the first entry for that game.

davidschlangen commented 3 weeks ago

Come to think of it, I would place this difference elsewhere, so as to not break the mechanism. Maybe something like scripts/cli.py run -g '{"game_name": "my-game"} and scripts/cli.py run -g my-game takes the first that matches (ie., same behaviour as model registry), while scripts/cli.py run -G '{"in-benchmark": "2.0"} matches all entries that unify with this.

davidschlangen commented 3 weeks ago

I think we need to be careful to keep game, directory containing game and instances, experiment, language of experiment, instance, etc. etc. separate, and be clear about what this mechanism should do.

YanaPalacheva commented 2 weeks ago

Hi! Barging in here with a bit of an outsider's perspective (I am working on multilingual versions of taboo and wordle).

First of all, I think it's a great step towards systematization. Personally, based on the discussion here I would add two things to the template suggested by @AnneBeyer:

First, for each game, we have three game "variables" we need to consider: version, language, game variant (e.g vanilla wordle, wordle with clue, wordle with critic).

Version - to be honest, I don't understand the definition of version.

Initially I thought it was the version of the framework itself: so there was clembench v0.9 and there were some games relying on this implementation (say, taboo-0.9). Then clembench v1.0 came, old games' implementations were accordingly updated ( taboo-1.0), old implementations are tagged/archived(?) and some new games are added. However, the only reflection of versions I see is in the suffix of some instances files (and the structure of the files is more of less similar).

Language should influence instance generation and resources only, the game logic should be language-agnostic. So, if the structure of instances is likely to change between versions, version and language should come in a bind.

Game variants are different games sharing a lot of code. So I believe that while the implementations should be refactored and unified under the same parent game folder, in the game registry, each entry should reflect a specific game variant as suggested.

Second, if I understand the purpose of introducing collections correctly, I would implement them with arbitratry tags and then programmatically filter/combine games based on their tags (and probably define collections as a combination of tags and other parameters (versions, langs, etc.) based on the registry.

So the registry entry can follow this structure.

{
"game_name": mandatory, game identifier (resolving game variant),
"game_path": mandatory, path to game code
"description": "A brief description of the game"
"main_game": main game identifier (umbrella name)
"player": "single" | "two" | "multi"
"image": "none" | "single" | "multi"
-----------------------------------------
"tags": ["text-only" | "multimodal"  | ... , "e.g. spatial", ...] # allowing to filter and combine games based on arbitrary tags

"version": { # lists all benchmark versions in which this game was used 
    "v0.9": {
        "langs": ['en']
    },
    "v1.0": {
        "langs": ['en', 'es']
    },
    "v1.5": {
        "langs": ['en', 'ru']
    },
    ...
}
}
YanaPalacheva commented 2 weeks ago

Another high-level thing I would like to discuss is the approach to keep the framework and the games in one place. I think it would be way easier to maintain the whole setup if the two things were separated: one repo with the core framework logic and games in separate repos.

In this case, each game repo would contain a descriptor file similar to the game registry entry and the registry would look like a key-value map like game_name: repo_link. If that makes sense, I can create a duplicate of, say, taboo in a separate repo and see how it would work in connection to this framework repo as an example.

davidschlangen commented 1 week ago

I'm a little bit worried about overloading this. We should identify a main purpose, maybe something like "the primary function of an entry is to map between identifying characteristics and a directory". I think what's written above is compatible with that, but it would be good to keep that in mind.

With that said, and inspired by your last comment, @YanaPalacheva (although I don't fully understand it), maybe we could even do something that is a bit huggingface-like and allow for the path to be a repository identifier? The maximal solution here would then be to even automatically check out the repo, if it isn't in ../.

Or, more modestly, we could just have the understanding that the path is to a repository. Either way we could then add to the version description something like "release_tag": "0.5", ie a reference to a git release tag. (Or a hash.)

AnneBeyer commented 1 week ago

First step: Create a minimal working version that still allows calling games by name but where the games can live anywhere outside this repository (as specified in the game registry) (this is almost done and only needs to be finalized and tested)

Second step: spice up the ls command to give more details/allow filtering of games

Third step: think a bit more about/finalize requirements for using model registry to collectively run specific sets of games

AnneBeyer commented 1 week ago

This actually also overlaps with #62