clp-research / clembench

A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark
MIT License
22 stars 31 forks source link

private/shared should ensure proper turn taking (assistant / user / assistant) in messages #24

Closed davidschlangen closed 8 months ago

davidschlangen commented 10 months ago

assume that the generate() method requires consecutive turns between assitant and user.

p/s has reasons to deviate from that, but should normalise to that format before calling generate()

Gnurro commented 10 months ago

This comes up with other games as well.

The need to have this strict order of messages comes from the Llama2 chat format: In this format, user and assistant messages are always paired, and these pairs are delimited, not single messages. A user+assistant message pair is formatted as this: <s>[INST]{user message content}[/INST]{assistant message content}</s> Prompting for a reply leaves the assistant half out: <s>[INST]{user message content}[/INST] <s> is the BOS token, </s> the EOS token.

The consequence of this is that the passed messages list must strictly follow the order user/assistant/user/assistant/user/assistant/.../user In words: It must always start with a user message. It must always follow the strict order of user+assistant message pairs, except for the last message. It must always end with a user message.

EDIT: Current Llama2 and HuggingFace backends flatten consecutive same-role messages to conform to the order, but do so indiscriminately, which fixes at least that flaw (in a brute-force way). However, since I did not want to interfere with the games more than that, the start and end of messages are not changed in any way.

davidschlangen commented 10 months ago

That suggests a very simple temporary fix to at least let the games run through: change the template so that it will always add the </s> token.

Gnurro commented 10 months ago

That suggests a very simple temporary fix to at least let the games run through: change the template so that it will always add the </s> token.

How would that help with an assistant message at the start?

davidschlangen commented 10 months ago

Alright: This suggest a very simple temporary fix to at least let the games run through: change the template so that it will always add paired <s> and </s> tokens (because that we can guarantee). Also, change it so that it does not make any assumptions about whether there actually is any user or assistant message content.

Actually, that could even be the actual fix: make all templates be agnostic as to the turn taking. This leaves it up to the game designer to decide whether it is worth the risk of putting some models at a disadvantage by producing contexts that may not look exactly like their training data.

Downside is that we need to check all templates and cannot rely anymore on the model providing its own template.

Gnurro commented 10 months ago

I'm not sure if I understand the temporary fix for the Llama2 format/template correctly. I'll go step by step.

change the template so that it will always add paired <s> and </s> tokens (because that we can guarantee).

Do you mean to wrap each message with <s> and </s>, as in <s>{any message content}</s> and add them to context individually? To change the user+assistant pair format from <s>[INST]{user message content}[/INST]{assistant message content}</s> to <s>[INST]{user message content}[/INST]</s><s>{assistant message content}</s>? Or remove the user-input/instruction delimiters as well, making it <s>{user message content}</s><s>{assistant message content}</s>?

Also, change it so that it does not make any assumptions about whether there actually is any user or assistant message content.

So in case there is a user message that is not followed by an assistant message, instead of <s>[INST]{user message content}[/INST]{assistant message content}</s> add <s>[INST]{user message content}[/INST]</s> to context? And in turn if there is an assistant message that is not followed by an user message, instead of <s>[INST]{user message content}[/INST]{assistant message content}</s> add <s>[INST][/INST]{assistant message content}</s> to context? This would remove the need for the flattening, at least. While this will definitely impact the performance of models that were finetuned on this format to make them chat models, it would likely fix the implementation issues for now.

I have greater concerns with the second fix - I'll just put them down briefly here, to not diverge too far from the issue topic.

make all templates be agnostic as to the turn taking.

In my view, this would remove the essential part of the templates and the formats. As I see it, foundation models are finetuned on turn-taking formats, to follow turn-taking formats, to make them chat models. I admit this is based on my assumptions about transformer LLMs in general, but I doubt this is the place to discuss underlying theory.

This leaves it up to the game designer to decide whether it is worth the risk of putting some models at a disadvantage by producing contexts that may not look exactly like their training data.

This might force the game designer to be aware of and handle all model specifics of any model they want to test (and, in the extreme, all future models that may be tested with the game). I think this could have the consequence of individual games being tailored to specific models, even if just due to the additional burden to learn the specifics of a growing number of models. I'd personally prefer a benchmark to not have what looks like preferences to me.

davidschlangen commented 10 months ago

Can you post the jinja template here that the huggingface llamas provide? Then I can explain.

Gnurro commented 10 months ago

{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %} This is the template for the Meta Llama2 chat models. It's stored in the tokenizer_config.json of each tokenizer that has a predefined chat template.

davidschlangen commented 10 months ago

Alright, so it’s probably just the part where the exception is raised.

The idea is to change the template so that it gracefully accepts any kind of list of role / message combinations. If it is as expected (user/assistant/user), it produces the same output as the template above. If it is not (e.g., assistant/user/user/assistant/god/assistant/user), it produces the closest approximation.

The point precisely is that the games do not need to know anything about the peculiarities of the models; including about constraints that they put up. The only fairness that we are committed to is that a human player would understand what’s going on.

(That said, the default assumption when setting up a game should still be that it provides a context to the player that amounts to a normal dialogue. But if a game has reasons not to do that, this shouldn’t crash.)

But no worries, for now it’s too late and all remaining models should be run in the same way (and probably fare worse on some games than they would have to). This can be tackled with the next version.

briemadu commented 8 months ago

Fixed privateshared by adding an "ok" turn from assistant before the first round of probing in https://github.com/clp-research/clembench/commit/aabaecdf10ce4e9a4e43b4b54b413380ff812c7e .

briemadu commented 8 months ago

This comes up with other games as well.

I will close the issue since it was about privateshared. If other games still have to handle this problem, please specify.