clp-research / clembench

A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark
MIT License
26 stars 34 forks source link

[feature] Enable easy re-prompting mechanism in DialogueGameMaster #41

Closed lpfennigschmidt closed 9 months ago

lpfennigschmidt commented 10 months ago

When using the DialogueGameMaster there is currently no easy way to reprompt a model on an invalid response. The fix is to extract the prompt function into its separate function and introduce two more hooks into the framework that do not break any other games:

The prompting mechanism is extracted into a separate function:

class DialogueGameMaster(GameMaster)
    [...]
    def prompt(self, player: Player):
        # GM -> Player
        history = self.messages_by_names[player.descriptor]
        assert history, f"messages history must not be empty for {player.descriptor}"

        last_entry = history[-1]
        assert last_entry["role"] != "assistant", "Last entry should not be assistant " \
                                                    "b.c. this would be the role of the current player"
        message = last_entry["content"]

        action = {'type': 'send message', 'content': message}
        self.log_event(from_='GM', to=player.descriptor, action=action)

        _prompt, _response, response_message = player(history, self.current_turn)

        # Player -> GM
        action = {'type': 'get message', 'content': response_message}
        self.log_event(from_=player.descriptor, to="GM", action=action, call=(_prompt, _response))

        # GM -> GM
        self.__validate_parse_and_add_player_response(player, response_message)

We add two hooks whether re-prompting should be done and enabling a message to be added before reprompting:

class DialogueGameMaster(GameMaster)
    [...]
    def _should_reprompt(self, player: Player):
        return False

    def _on_before_reprompt(self, player: Player):
        """
        Hook

        Change the prompt to reprompt the player on e.g. an invalid response.
        Add the new prompt to the players message via self.add_user_message(player, new_prompt)

        :param player: that produced the invalid response
        """
        pass

Then the play-function becomes this:

class DialogueGameMaster(GameMaster)
    [...]
    def play(self) -> None:
        self._on_before_game()
        while self._does_game_proceed():
            self.log_next_turn()  # not sure if we want to do this always here (or add to _on_before_turn)
            self._on_before_turn(self.current_turn)
            self.logger.info(f"{self.name}: %s turn: %d", self.name, self.current_turn)
            for player in self.__player_sequence():
                if not self._does_game_proceed():
                    break  # potentially stop in between player turns
                self.prompt(player)
                while self._should_reprompt(player):
                    self._on_before_reprompt(player)
                    self.prompt(player)
            self._on_after_turn(self.current_turn)
            self.current_turn += 1
        self._on_after_game()

🎉

phisad commented 10 months ago

I like the idea and the contribution. Could you open a PR? We can have this merged quite soon I guess as it looks backwards compatible. The code looks cleaner and the main loop is easier to understand by pushing the functionality into self.prompt().

phisad commented 10 months ago

We might want to mark the reprompting in the logs. So maybe provide an optional argument to change the event action type. e.g. "send message (reprompt)". Could you test if this works for the transcripts?

Gnurro commented 10 months ago

We might want to mark the reprompting in the logs. So maybe provide an optional argument to change the event action type. e.g. "send message (reprompt)". Could you test if this works for the transcripts?

Yeah, this should definitely be in the logs and be accessible for scoring. Models 'getting it right' on the first try is very impactful for end users, and to my knowledge there is no other benchmark that tracks this concisely (at least at the intricate level clembench allows for).

lpfennigschmidt commented 10 months ago

Yup, will do a PR on Monday, just need to figure out how to :) I have to open one from my fork, right?

Gnurro commented 10 months ago

Yup, will do a PR on Monday, just need to figure out how to :) I have to open one from my fork, right?

Yes, and if the fork is set up properly, it should be straight-forward by clicking the button on GH.

phisad commented 9 months ago

Fixed with https://github.com/clp-research/clembench/pull/42