Closed josephrocca closed 2 months ago
Note that OpenRouter also supports specifying a prefix for the assistant response via the same method, so I think that this feature is standardized enough that it's safe to add it.
Assistant Prefill: OpenRouter supports asking models to complete a partial response. This can be useful for guiding models to respond in a certain way. To use this feature, simply include a message with role: "assistant" at the end of your messages array. Example:
await fetch("https://openrouter.ai/api/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${OPENROUTER_API_KEY}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
"model": "mistralai/mixtral-8x7b-instruct",
"messages": [
{"role": "user", "content": "Who are you?"},
{"role": "assistant", "content": "I'm not sure, but my best guess is"},
],
})
});
I will have to find the correct chat template and manually construct the text for now
Hmm, it seems that this is a little more complex with VLM than with LLM. If anyone has any tips on the easiest / most minimal change to the API server required to achieve something like this, that would be great! Maybe a hacky approach that I could try is to just remove the last N tokens right before starting generation - where N corresponds to the number of special tokens in the template that come after a message is finished and before the start of the next message.
Hi, @josephrocca In LMDeploy, we decorate the multi-round messages using a model's chat template as follows:
def messages2prompt(self, messages, sequence_start=True, **kwargs):
"""Return the prompt that is concatenated with other elements in the
chat template.
Args:
messages (str | List): user's input prompt
Returns:
str: the concatenated prompt
"""
if isinstance(messages, str):
return self.get_prompt(messages, sequence_start)
box_map = dict(user=self.user,
assistant=self.assistant,
system=self.system)
eox_map = dict(user=self.eoh,
assistant=self.eoa + self.separator,
system=self.eosys)
ret = ''
if self.meta_instruction is not None and sequence_start:
if len(messages) and messages[0]['role'] != 'system':
ret += f'{self.system}{self.meta_instruction}{self.eosys}'
for message in messages:
role = message['role']
content = message['content']
ret += f'{box_map[role]}{content}{eox_map[role]}'
ret += f'{self.assistant}'
return ret
As you can see, after the loop of messages
, we pad {self.assistant}
. It results in two consecutive "assistants" in the final prompt. I think that's the reason.
Here is the verifying code:
from lmdeploy.model import MODELS
# I use "internlm2" chat template as an example
chat_template = MODELS.get("internlm2")()
# the messages from your issue
messages = [
{"role": "user", "content": "What's the Greek name for Sun? (A) Sol (B) Helios (C) Sun"},
{"role": "assistant", "content": "The best answer is ("},
]
result = chat_template.messages2prompt(messages)
print(result)
The result is:
<|im_start|>system
You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
<|im_end|>
<|im_start|>user
What's the Greek name for Sun? (A) Sol (B) Helios (C) Sun<|im_end|>
<|im_start|>assistant
The best answer is (<|im_end|>
<|im_start|>assistant
See there are two consecutive "<|im_start|>assistant" in the prompt, which is illegal
@AllentDan I think we can support the specifying prefix feature
Motivation
It's often useful to specify a prefix to an LLM's response to help get it on the right track (e.g. specifying prefix as
{
to start JSON so it doesn't reply "Sure, I can generate some JSON for you..."). It's a simple and easy form of output constraint.Anthropic gives this example in their docs:
The context for this feature request is that I was using Sonnet 3.5 for image captioning, and then wanted to try
OpenGVLab/InternVL2-Llama3-76B-AWQ
with LMDeploy. Through prompt engineering with Sonnet 3.5, I found that I achieved very significantly higher success rates by using the assistant response prefix feature. Upon testing with LMDeploy, this doesn't seems to be supported, and it seems that InternVL2 has similar failure modes to Sonnet 3.5, so I think it would also benefit significantly from the response prefix constraint.I think that I will have to find the correct chat template and manually construct the text for now, but it would be great if this were considered as a feature for the next version of LMDeploy.
Related resources
Additional context
Note that the response for this:
is this:
So the response only includes the content after the specified prefix - i.e. it does not return the given prefix as part of the response.