clp-research / clembench

A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark
MIT License
19 stars 26 forks source link

backends should expose a method for getting size in tokens / max context size violation #22

Closed davidschlangen closed 2 months ago

davidschlangen commented 7 months ago

Some games can run into the danger of producing contexts that are bigger than what a given model allows. To systematically check for that, backends should expose one standard method for checking for this. (Perhaps even expressed in percentage of max context size, so that information about the context size can stay in the backend.)

Gnurro commented 7 months ago

I think a percentage is too arbitrary to be useful here, as its meaning can change drastically between models/tokenizers. If context management is left to the game side, I think passing remaining context space, as in the number of tokens that can still be added, is more useful.

Gnurro commented 7 months ago

I've made a demo/prototype for doing this for HF transformers-based backends: https://github.com/Gnurro/clembench/blob/hf_context_handling/backends/hf_context_exp.py

davidschlangen commented 7 months ago

This looks good -- but seems to be a separate class? My idea was that this is just a method of the backend (like generate()) [there was a typo in the title of this issue, it was supposed to say "expose a method"]. Is there anything that would argue against that?

Gnurro commented 7 months ago

The method can be added to the existing backend code and would work the same. I just made the demo class to test and show the functionality, without needing to load a full model. It's on a branch of my fork for convenient access, not to be pulled into the main repo. I also want to add some implementation tests/ideas for dialog/messages checking to it, which would be hard to show and discuss if they were already integrated into existing backend code.

Gnurro commented 5 months ago

The externally callable method is done in the local huggingface backend overhaul (https://github.com/Gnurro/clembench/commit/f87b383b6d7b7f1911330f4051cce947532988bd ). I've also added a class-internal method that is called for each response generation: https://github.com/Gnurro/clembench/blob/ceec70c96ac7e52b8e2eedcb30120a1b0aeefa49/backends/huggingface_local_api.py#L311-L315 What should the generation method return if this check fails? I assume this should fail more gracefully than before. With the old backend, exceptions/CUDA OOMs due to exceeding context limits just cancel running the current episode, leaving incomplete run records. (This is just to make perfectly sure that there's no issues - future versions of existing clemgames and new ones should be made using the external checking method in the first place.)

davidschlangen commented 5 months ago

Sounds good. Let me think about it. Intuitively, I would think that this is something for the games themselves to handle (in a possible v2.0), because it may be different for each game what it wants to do to create a shorter context, which turns can be removed. (This is of course orthogonal to what the backend does when a call is made without a prior check. Maybe just log this information, so that the eval script can do something with this information?)

davidschlangen commented 5 months ago

A comment about the implementation: the public method should call the private one. At the moment, it duplicates the code of the private one.

Gnurro commented 5 months ago

I think a good solution would be to raise a specific exception for the clemgame to catch and handle. Another (potentially additional) option is passing the context check return tuple to the clemgame, since that contains the token counts and limit, so that the clemgame can handle this more dynamically. I'm unsure if it should be passed the same way as proper responses are returned, though.

Gnurro commented 5 months ago

A comment about the implementation: the public method should call the private one. At the moment, it duplicates the code of the private one.

Done: https://github.com/Gnurro/clembench/commit/d8b9129d13e56a507d41067697857458a8e2cab1

Gnurro commented 5 months ago

I've implemented it using a specific exception for now: https://github.com/Gnurro/clembench/commit/d6ce06c4124e44c6a41ea969995c8c31b3d139a9 I think having it as an exception like this also emphasizes the need to handle this on the clemgame side, and the exception instance itself has the necessary token counts. Catching and handling the exception inside the clemgame code should also prevent the record issues, as the exception is raised before the generation method returns anything. The logger does make an entry about it for clembench.log, but the evaluation scripts and other parts of the framework shouldn't break anymore as this should prevent the creation of malformed records.

phisad commented 5 months ago

Looks good. Is this check applicable (or necessary to apply) to other backends as well?

Gnurro commented 5 months ago

Looks good. Is this check applicable (or necessary to apply) to other backends as well?

I think it would be good to have this for other backends as well, even if their (remote) API handles the limits automatically - it lets clemgame developers control which messages are given to the model instead of having those backends do it arbitrarily.

The ContextExceededError is supposed to go into backends/_init_.py later on, so it can be used by all backends. And there could be a try clause for it in the Player/clemgame base classes to align the handling of it for all clemgames.

Gnurro commented 5 months ago

ContextExceededError implemented with https://github.com/clp-research/clembench/pull/37

phisad commented 5 months ago

Note to me: Functionality is there in huggingface_local_api.py which needs to be moved to init or even better a backends.utils.py

davidschlangen commented 2 months ago

@phisad , can be closed?

phisad commented 2 months ago

Can be closed because implemented with https://github.com/clp-research/clembench/pull/81