InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.48k stars 404 forks source link

[Feature] Anthropic-style OpenAI-API extension for specifying prefix of assistant's response #2161

Closed josephrocca closed 2 months ago

josephrocca commented 2 months ago

Motivation

It's often useful to specify a prefix to an LLM's response to help get it on the right track (e.g. specifying prefix as { to start JSON so it doesn't reply "Sure, I can generate some JSON for you..."). It's a simple and easy form of output constraint.

Anthropic gives this example in their docs:

[
  {"role": "user", "content": "What's the Greek name for Sun? (A) Sol (B) Helios (C) Sun"},
  {"role": "assistant", "content": "The best answer is ("},
]

The context for this feature request is that I was using Sonnet 3.5 for image captioning, and then wanted to try OpenGVLab/InternVL2-Llama3-76B-AWQ with LMDeploy. Through prompt engineering with Sonnet 3.5, I found that I achieved very significantly higher success rates by using the assistant response prefix feature. Upon testing with LMDeploy, this doesn't seems to be supported, and it seems that InternVL2 has similar failure modes to Sonnet 3.5, so I think it would also benefit significantly from the response prefix constraint.

I think that I will have to find the correct chat template and manually construct the text for now, but it would be great if this were considered as a feature for the next version of LMDeploy.

Related resources

Additional context

Note that the response for this:

curl https://api.anthropic.com/v1/messages \
     --header "x-api-key: your-api-key-here" \
     --header "anthropic-version: 2023-06-01" \
     --header "content-type: application/json" \
     --data '{"model":"claude-3-5-sonnet-20240620", "max_tokens":1024, "messages":[{"role":"user", "content":"Give a short answer without explanation. What is the Greek name for Sun? (A) Sol (B) Helios (C) Sun"}, {"role": "assistant", "content": "The best answer is ("}]}'

is this:

{"role":"assistant","model":"claude-3-5-sonnet-20240620","content":[{"type":"text","text":"B) Helios."}],"stop_reason":"end_turn","stop_sequence":null,"usage":{"input_tokens":41,"output_tokens":9}}

So the response only includes the content after the specified prefix - i.e. it does not return the given prefix as part of the response.

josephrocca commented 2 months ago

Note that OpenRouter also supports specifying a prefix for the assistant response via the same method, so I think that this feature is standardized enough that it's safe to add it.

Assistant Prefill: OpenRouter supports asking models to complete a partial response. This can be useful for guiding models to respond in a certain way. To use this feature, simply include a message with role: "assistant" at the end of your messages array. Example:

await fetch("https://openrouter.ai/api/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${OPENROUTER_API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    "model": "mistralai/mixtral-8x7b-instruct",
    "messages": [
      {"role": "user", "content": "Who are you?"},
      {"role": "assistant", "content": "I'm not sure, but my best guess is"},
    ],
  })
});
josephrocca commented 2 months ago

I will have to find the correct chat template and manually construct the text for now

Hmm, it seems that this is a little more complex with VLM than with LLM. If anyone has any tips on the easiest / most minimal change to the API server required to achieve something like this, that would be great! Maybe a hacky approach that I could try is to just remove the last N tokens right before starting generation - where N corresponds to the number of special tokens in the template that come after a message is finished and before the start of the next message.

lvhan028 commented 2 months ago

Hi, @josephrocca In LMDeploy, we decorate the multi-round messages using a model's chat template as follows:

def messages2prompt(self, messages, sequence_start=True, **kwargs):
        """Return the prompt that is concatenated with other elements in the
        chat template.

        Args:
            messages (str | List): user's input prompt
        Returns:
            str: the concatenated prompt
        """
        if isinstance(messages, str):
            return self.get_prompt(messages, sequence_start)
        box_map = dict(user=self.user,
                       assistant=self.assistant,
                       system=self.system)
        eox_map = dict(user=self.eoh,
                       assistant=self.eoa + self.separator,
                       system=self.eosys)
        ret = ''
        if self.meta_instruction is not None and sequence_start:
            if len(messages) and messages[0]['role'] != 'system':
                ret += f'{self.system}{self.meta_instruction}{self.eosys}'
        for message in messages:
            role = message['role']
            content = message['content']
            ret += f'{box_map[role]}{content}{eox_map[role]}'
        ret += f'{self.assistant}'
        return ret

As you can see, after the loop of messages, we pad {self.assistant}. It results in two consecutive "assistants" in the final prompt. I think that's the reason.

Here is the verifying code:

from lmdeploy.model import MODELS

# I use "internlm2" chat template as an example
chat_template = MODELS.get("internlm2")()

# the messages from your issue
messages = [
  {"role": "user", "content": "What's the Greek name for Sun? (A) Sol (B) Helios (C) Sun"},
  {"role": "assistant", "content": "The best answer is ("},
]

result = chat_template.messages2prompt(messages)
print(result)

The result is:

<|im_start|>system
You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
<|im_end|>
<|im_start|>user
What's the Greek name for Sun? (A) Sol (B) Helios (C) Sun<|im_end|>
<|im_start|>assistant
The best answer is (<|im_end|>
<|im_start|>assistant

See there are two consecutive "<|im_start|>assistant" in the prompt, which is illegal

lvhan028 commented 2 months ago

@AllentDan I think we can support the specifying prefix feature