implement prompt template for chat completion

ehartford commented 1 year ago

Is your feature request related to a problem? Please describe. When generating chat completion, it is hard-coded to generate a non-standard prompt template that looks something like:

### User: <blabla>
### Assistant: <blabla>

system message is currently ignored.

https://github.com/abetlen/llama-cpp-python/blob/255d653ae3bd08c690dbf01f9533daf75f71217c/llama_cpp/llama.py#L1578

This mostly works for most models. But it's not correct.

Describe the solution you'd like

add a set of built-in prompt templates user can specify at inference time ["vicuna","alpaca","chatml","llama2-chat","oasst"] at minimum
recommend copying design from ooba's instruction templates or fastchat's conversation
add ability to pass a template string for other nonstandard formats (such as the one currently implemented in llama-cpp-python).

Describe alternatives you've considered modifying llama-cpp-python to hard code it to llama2-chat format, not a great solution.

Additional context

ehartford commented 1 year ago

This is a hefty task, with architecture / design elements to do it in a clean way. I am busy to take it on myself right now, but in a couple weeks I can try if nobody else has done it.

abetlen commented 1 year ago

Hey @ehartford this is actually something I've had in the backlog and just started last night in #711

My plan is to have those format identifiers and also provide a generic class (?) that users can extend to provide a custom chat template. The challenge is that it's not just the prompt that has to be modified but also stop sequences, grammar (in the case of open ai style function calling chats), and a few more things I probably haven't thought about but I think this is do-able.

Thank you for the resources btw!

ehartford commented 1 year ago

Awesome thanks! I think it can be done without requiring the user to write any code, using a clever template system, as is implement by ooba and fastchat.

lexin4ever commented 1 year ago

Like workaround. I use /v1/completions method (create_completion function in llama.py instead of create_chat_completion), which allow me to setup any prompt. And I can format message as I want and pass it as string prompt param.

ehartford commented 1 year ago

true, but that would require rewriting chatbot-ui to use /completions instead of /chat/completions (or using a reverse proxy to do that)

ehartford commented 1 year ago

then, my solution is that I will make a proxy that receives calls to /chat/completions, and rewrites them into a call into llama-cpp-python's /completions endpoint, in order to inject the proper prompt format.

teleprint-me commented 1 year ago

I just ran through some rough drafts with GPT.

Proposal for Advanced Customizable Prompts in Chat Completions

Problem Statement

The existing implementation for chat completions uses hard-coded prompts, constraining customization and flexibility. This limitation becomes evident when adapting the code for specific projects or applications that require unique prompt styles or formats.

    PROMPT = chat_history + "### Assistant:"
    PROMPT_STOP = ["### Assistant:", "### Human:"]

Proposed Solution

I propose two new optional parameters, prompt and prompt_stop, to the create_chat_completion method. These will allow users to specify custom prompt and stop token sequences.

    def create_chat_completion(
        # ...existing parameters...
        prompt: Optional[str] = None,
        prompt_stop: Optional[List[str]] = None,
    ):
        # ...
        PROMPT = chat_history + (prompt if prompt else "### Assistant:")
        PROMPT_STOP = prompt_stop if prompt_stop else ["### Assistant:", "### Human:"]
        # ...

Benefits

Enhanced Flexibility: These changes offer users a high level of customization for prompt structures.
Wider Applicability: Extending prompt customization increases the adaptability of the code to different use-cases.
Ease of Use: The optional parameters maintain the cleanliness of the API while offering more freedom to users.

Backward Compatibility

The proposal maintains backward compatibility since both new parameters are optional and will use existing hard-coded values as defaults.

Suggested Defaults

In absence of custom prompts, the system could default to prompts styled after Llama-2's structure as a sane default:

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """You are a helpful assistant."""

Practical Examples

    my_custom_prompt = ">>> Custom Assistant:"
    my_custom_stop = [">>> Custom Assistant:", ">>> Custom User:"]
    create_chat_completion(
        messages=...,
        prompt=my_custom_prompt,
        prompt_stop=my_custom_stop,
        # ...other parameters...
    )

Related Works

This proposal aims to integrate well with ongoing work in the configurable-chat-templates branch and issues like #711, focusing on handling bos and eos tokens.

I'm not sure if this fits well with what you guys had in mind. Let me know either way. I had the same idea though.

teleprint-me commented 1 year ago

I was looking Open Interpreter and the source code was using litellm. So, I figured I'd take a peek at it and vllm.

I checked out the docs for litellm templates and they have a fairly nice structure for prefixing and postfixing.

# Create your own custom prompt template works 
litellm.register_prompt_template(
        model="togethercomputer/LLaMA-2-7B-32K",
        roles={
            "system": {
                "pre_message": "[INST] <<SYS>>\n",
                "post_message": "\n<</SYS>>\n [/INST]\n"
            },
            "user": {
                "pre_message": "[INST] ",
                "post_message": " [/INST]\n"
            }, 
            "assistant": {
                "post_message": "\n"
            }
        }
    )

def test_huggingface_custom_model():
    model = "huggingface/togethercomputer/LLaMA-2-7B-32K"
    response = completion(model=model, messages=messages, api_base="https://ecd4sb5n09bo4ei2.us-east-1.aws.endpoints.huggingface.cloud")
    print(response['choices'][0]['message']['content'])
    return response

test_huggingface_custom_model()

Found it pretty interesting because you can feed in the structure as a dict and then grab the values by the keys.

I ran through it with GPT again and this is what it came up with as a proof-of-concept.

Revised Proposal for Role-Based Customizable Prompts in Chat Completions

Problem Statement

The current chat completions implementation relies on hard-coded prompts, limiting customization and flexibility. This is a bottleneck when adapting the code to specialized projects requiring unique role-based prompt styles or formats.

Proposed Solution

Replace existing prompt and prompt_stop with a single role_templates parameter to the create_chat_completion method. This will offer users the capability to specify custom role-based formatting for different parts of the conversation.

def create_chat_completion(
    # ...existing parameters...
    role_templates: Optional[Dict[str, Dict[str, str]]] = None
):
    # ...existing code...

Benefits

Enhanced Flexibility: Users gain high levels of customization for role-based prompt structures.
Wider Applicability: The new architecture is adaptable to various chat roles and use-cases.
Ease of Use: Replacing multiple parameters with a single role_templates parameter streamlines the API.

Backward Compatibility

This change maintains backward compatibility since the role_templates parameter is optional and defaults to existing hard-coded values if not provided.

Suggested Defaults

A reasonable default could mirror Llama-2's prompt structure:

DEFAULT_ROLE_TEMPLATES = {
    "system": {
        "pre_message": "[INST] <<SYS>>\n",
        "post_message": "\n<</SYS>>\n [/INST]\n"
    },
    "user": {
        "pre_message": "[INST]",
        "post_message": " [/INST]\n"
    },
    "assistant": {
        "post_message": "\n"
    }
}

Related Works

Inspired by role-based prompt templates in litellm.
Could also integrate with custom roles specific to llama-cpp-python.

I know it won't be that simple after reviewing the code.

Just wanted to share. Maybe it would inspire something.

NightMachinery commented 1 year ago

What models actually use the current chat prompt template?

It seems most models use Alpaca's format:

<System prompt/Character Card>

### Instruction:
Your instruction or question here.
For roleplay purposes, I suggest the following - Write <CHAR NAME>'s next reply in a chat between <YOUR NAME> and <CHAR NAME>. Write a single reply only.

### Response:

teleprint-me commented 1 year ago

@NightMachinery

It depends on the dataset and how it's trained and/or finetuned.

The format varies from model to model, but the 2 most popular formats are usually

"### Instruction:" and "### Assistant:"

and

"### Human:" and "Assistant:"

Sometimes it's "### Human:" and "### Bot:"

Open Assistant uses a mixture depending on version and dataset, "prompter:", or "human:", and "assistant:".

Some models are more complex than others, e.g. it's system prompt, input, instruction, and then response.

There's no fixed, or commonly accepted, format yet as far as I can tell.

Most chat models follow system, user, assistant, or some variation. Whether there are tokens that are used to denote which is which depends.

ehartford commented 1 year ago

The closest thing to standard is ChatML. And it's not widely accepted.

I've adopted it, and open assistant has adopted it. Vicuna and wizardLM haven't.

Hopefully a consensus emerges in the next year.

teleprint-me commented 1 year ago

@ehartford

I'm for ChatML.

The high-level interface is intuitive and easy to reason about and follow.

The low-level interface is similar to what Meta did with Llama-2's chat interface.

The tokens could probably be simplified though. Maybe the use of something more like markup would be an improvement?

<system>System prompt goes here</system>
<user>User prompt goes here</user>
<assistant>

And just "teach" the model that </tag> is always the stop token for that token sequence.

The the output could be parsed similar to XML/HTML.

I'm still learning, so just take what I'm saying with a friendly grain of salt.

This is something I plan on experimenting with if I get the opportunity to do it in the future.

I agree though, a consensus would be nice.

ehartford commented 1 year ago

What models actually use the current chat prompt template?

It seems most models use Alpaca's format:

<System prompt/Character Card>

### Instruction:
Your instruction or question here.
For roleplay purposes, I suggest the following - Write <CHAR NAME>'s next reply in a chat between <YOUR NAME> and <CHAR NAME>. Write a single reply only.

### Response:

The current scheme implemented in llama-cpp-python doesn't follow a convention I know of.

Please see the links in my original issue for a comprehensive and detailed list of the currently popular prompt templates.

90%+ of use cases will be covered if the following formats are supported:

Llama-2-chat
ChatML
Vicuna
WizardCoder
Alpaca
OpenAssistant

The best source of documentation on these prompt formats is probably the model cards in TheBloke's distributions which are very well researched.

abetlen commented 1 year ago

Hey @ehartford I just merged in the #711 PR which adds a mechanism to specify common chat formats through a chat_format parameter on the Llama class and the server settings.

Currently supports:

llama-2
alpaca
vicuna
oasst_llama
openbuddy
redpajama-incite
snoozy
phind
open-orca

Let me know if that works for you!

ehartford commented 1 year ago

Nice!

ehartford commented 1 year ago

ChatML would be lovely it's garnering more support

r7l commented 1 year ago

I've noted this as well and it's great to see support being added just now. But looking at the code, it seems as if there is room to be a bit more flexible and customizable.

For example, in LocalAI they allow people to add Yaml files with a configuration preset for each model. I really like their idea in general.

Maybe it would be an option for the future to have something similar. Instead of having everything being fixed into the code, allow people to add a Yaml file option and pass the content of the file into a format_from_yaml like function.

NightMachinery commented 1 year ago

I've noted this as well and it's great to see support being added just now. But looking at the code, it seems as if there is room to be a bit more flexible and customizable.

For example, in LocalAI they allow people to add Yaml files with a configuration preset for each model. I really like their idea in general.

Maybe it would be an option for the future to have something similar. Instead of having everything being fixed into the code, allow people to add a Yaml file option and pass the content of the file into a format_from_yaml like function.

Or better yet, accept a general lambda as an argument and implement the YAML idea as a specific lambda that can take a YAML file and template the response. E.g.,

chat_template_fn=partial(yaml_template_reader, yaml_path="...")

teleprint-me commented 1 year ago

Anything besides yaml, please 🙏. Simple is always better.

abetlen commented 1 year ago

@ehartford I'll add that and a few others I missed (mistral as well).

@r7l I'll consider this but likely as a utility for the server that converts a config file / template into a chat formatting function.

bioshazard commented 1 year ago

Thank you for your work on chat templates and llama-cpp-python generally!!

Curious if you could just universally piggy back the HuggingFace template hub or let users specify a tokenizer_config.json to completely outsource it to this developing standard of rendering arbitrary Jinja provided with the model? I would be surprised if new model releases don't all start coming with their own tokenizer config definition.

EDIT: Delivered the above in linked PR

LynxPDA commented 1 year ago

@abetlen

The Prompt template: ChatML does not stop generating when using Prompt template: ChatML. I think we need to add a stop token. stop_str = "<|im_end|>" and return ChatFormatterResponse(prompt=_prompt, stop=stop_str)

It worked for me.

@register_chat_format("chatml")
def format_chatml(
    messages: List[llama_types.ChatCompletionRequestMessage],
    **kwargs: Any,
) -> ChatFormatterResponse:
    system_template = """<|im_start|>system
{system_message}"""
    system_message = _get_system_message(messages)
    system_message = system_template.format(system_message=system_message)
    _roles = dict(user="<|im_start|>user", assistant="<|im_start|>assistant")
    _sep = "<|im_end|>"
    stop_str = "<|im_end|>"
    _messages = _map_roles(messages, _roles)
    _messages.append((_roles["assistant"], None))
    _prompt = _format_chatml(system_message, _messages, _sep)
    return ChatFormatterResponse(prompt=_prompt, stop=stop_str)

bioshazard commented 1 year ago

Default stop token would be huge for realizing transparent model provider.

teleprint-me commented 1 year ago

There's a problem with using the stop tokens.

I'm not sure what the difference is yet, but I noticed that using the special tokens in the user facing templates causes a lot of issues.

I would advise not using special tokens at all with llama.cpp. In almost every test I conducted, the models started repeating themselves, derailing, and more.

Using the base template seems to work beautifully though. Not a single issue once I do that.

bioshazard commented 1 year ago

So you exclude them from the template, but still you would set it as a default stop sequence item right? Saves you from having to specify it in the payload. That will be needed to realize total model backing transparency to fully decouple model from chat consumer other than maybe max token in payload

teleprint-me commented 1 year ago

I noted it in my PR on L14.

IMPORTANT NOTES:

Omitted for brevity [...]
Special tokens are crucial for the model's underlying operations, impacting pre-training, fine-tuning, and low-level inference processes. Users should avoid modifying special tokens to prevent issues in the model's output during inference. These issues may manifest as token fixation, repetitive language patterns, contextual derailment, and hallucinations. Improper use of separators and templates can exacerbate these problems.

Example using the llama-2 model and its templating schema:

  1  <<SYS>>My name is Llama and I am a helpful assistant.<</SYS>>$
  2  [INST] Hello Llama, my name is User. What's your name? [/INST]$
  3  Hello User, my name is Llama. Nice to meet you!$
  4  [INST] What can you do? [/INST]$
  5  I can assist you with various tasks, including providing structured output for certain queries.$
  6  [INST] How can you assist me in my programming projects? [/INST]$
  7  $

This initial example is a proper template format that the model understands. It results in proper output and does not confuse the model.

  1  <<SYS>>My name is Llama and I am a helpful assistant.<</SYS>>$
  2  <s>[INST] Hello Llama, my name is User. What's your name? [/INST]$
  3  Hello User, my name is Llama. Nice to meet you!</s>$
  4  <s>[INST] What can you do? [/INST]$
  5  I can assist you with various tasks, including providing structured output for certain queries.</s>$
  6  <s>[INST] How can you assist me in my programming projects? [/INST]$
  7  $

This example includes the use of special tokens, and the model may or may not use these tokens as a result. The model is not expecting them during inference, which causes unexpected behavior.

  1  <<SYS>>My name is Llama and I am a helpful assistant.<</SYS>>$
  2  $
  3  <s>[INST] Hello Llama, my name is User. What's your name? [/INST]$
  4  Hello User, my name is Llama. Nice to meet you!</s>$
  5  $
  6  <s>[INST] What can you do? [/INST]$
  7  I can assist you with various tasks, including providing structured output for certain queries.</s>$
  8  $
  9  <s>[INST] How can you assist me in my programming projects? [/INST]$
 10  $

This example is improperly formatted and causes the model to become confused. The model begins to fixate on tokens, uses language repetition, and eventually derails.

Note that the $ symbols are substitutes for newline characters, e.g. \n. They're part the output for cat -A.

Mwni commented 1 year ago

I propose we use a dedicated library for this: chatformat. The functionality to format chat prompts is not specific to this project. Creating a shared library will help other developers, but also help us by attracting contributions from outside. It's a win-win.

Additionally, what I'm missing with the current implementation is the possibility to "preface" the model's output. Prefacing means "putting words into the model's mouth".

The issue is that the current implementation seals off the last incomplete message:

USER: How do you feel?
ASSISTANT: I feel </s>

The </s> should not be there. Chatformat leaves the last assistant message open by default.

earonesty commented 1 year ago

we should probably use jinja templates, so the user can specify them at runtime if needed. engines won't know the template to use. users can have models they just finished fine tuning with custom grammars, etc.

this will get everyone what they want.

use a well-known "named" template or
make one up on the fly

Mwni commented 1 year ago

@earonesty Good point about having custom templates. But I think using a templating engine is overcomplicating the matter. These chat formats generally consist of "rounds" that are stacked together.

A round is defined as

A system message (optional)
A user prompt
The model response

We can cover 99% of all possible formats by

defining the template string for the first round with system prompt
defining the template string for consecutive rounds without system prompt
defining how to join rounds

So for example for Alpaca, the format can be defined as:

alpaca:
  with_system: |-
    {system}

    ### Instruction:
    {user}

    ### Response:
    {assistant}</s>

  without_system: |-
    ### Instruction:
    {user}

    ### Response:
    {assistant}</s>

  round_seperator: "\n\n"

If you know of a format that is not covered by this convention, please comment.

teleprint-me commented 1 year ago

@Mwni

If you know of a format that is not covered by this convention, please comment.

Llama-1, Llama-2, RedPajama, Mistral, Refact, etc...

earonesty commented 1 year ago

jinja2 is incredibly easy and lightweight

and it's 100% compatible with what all current researchers are producing in there token configuration files

On Mon, Nov 6, 2023, 9:48 AM Austin @.***> wrote:

@Mwni https://github.com/Mwni

Llama-1, Llama-2, RedPajama, Mistral, etc...

— Reply to this email directly, view it on GitHub https://github.com/abetlen/llama-cpp-python/issues/717#issuecomment-1795000465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMUIVYH5DXJBOI4WXTXTYDD2FNAVCNFSM6AAAAAA4YHXNRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJVGAYDANBWGU . You are receiving this because you were mentioned.Message ID: @.***>

Mwni commented 1 year ago

@teleprint-me The following prompts were generated using the proposed scheme

Llama-2

<s>[INST] <<SYS>>
You are a very clever LLM.
<</SYS>>

Hello? [/INST] Hello.</s><s>[INST] What are you thinking? [/INST] I think that

Vicuna (and Mistral)

You are a very clever LLM.

USER: Hello?
ASSISTANT: Hello.</s>
USER: What are you thinking?
ASSISTANT: I think that

ChatML

<|im_start|>system
You are a very clever LLM.<|im_end|>
<|im_start|>user
Hello?<|im_end|>
<|im_start|>assistant
Hello.<|im_end|>
<|im_start|>user
What are you thinking?<|im_end|>
<|im_start|>assistant
I think that

Where's the problem?

earonesty commented 1 year ago

the problem is that jinja2 is what is sitting in hf config files. so it's future compatible with stuff you haven't heard of, and can be sucked into gguf file metadata, so that the user isn't on the hook to specify a template when working with gguf files. it has a forward compatibility path that matters.

bioshazard commented 1 year ago

Having it load right from the meta data would be killer.

teleprint-me commented 1 year ago

@earonesty

Where in the specification is that? Also, ggerganov already stated he plans on using oblique templates and it will be a minimal, separate, implementation.

earonesty commented 1 year ago

gguf allows you to store any metadata you want. models at hf have jinja2 templates in their tokenizer configs., so really, it doesn't matter about the specification that much. can just add it to the convert script.

madprops commented 5 months ago

So can I define a custom format like:

"<|start_header_id|>{name}<|end_header_id|>\n\n"

How? So far I have a dropdown box that selects the pre-defined formats.

abetlen / llama-cpp-python