Open ehartford opened 1 year ago
This is a hefty task, with architecture / design elements to do it in a clean way. I am busy to take it on myself right now, but in a couple weeks I can try if nobody else has done it.
Hey @ehartford this is actually something I've had in the backlog and just started last night in #711
My plan is to have those format identifiers and also provide a generic class (?) that users can extend to provide a custom chat template. The challenge is that it's not just the prompt that has to be modified but also stop sequences, grammar (in the case of open ai style function calling chats), and a few more things I probably haven't thought about but I think this is do-able.
Thank you for the resources btw!
Awesome thanks! I think it can be done without requiring the user to write any code, using a clever template system, as is implement by ooba and fastchat.
Like workaround. I use /v1/completions
method (create_completion
function in llama.py instead of create_chat_completion
), which allow me to setup any prompt. And I can format message as I want and pass it as string prompt
param.
true, but that would require rewriting chatbot-ui to use /completions instead of /chat/completions (or using a reverse proxy to do that)
then, my solution is that I will make a proxy that receives calls to /chat/completions, and rewrites them into a call into llama-cpp-python's /completions endpoint, in order to inject the proper prompt format.
I just ran through some rough drafts with GPT.
The existing implementation for chat completions uses hard-coded prompts, constraining customization and flexibility. This limitation becomes evident when adapting the code for specific projects or applications that require unique prompt styles or formats.
PROMPT = chat_history + "### Assistant:"
PROMPT_STOP = ["### Assistant:", "### Human:"]
I propose two new optional parameters, prompt
and prompt_stop
, to the create_chat_completion
method. These will allow users to specify custom prompt and stop token sequences.
def create_chat_completion(
# ...existing parameters...
prompt: Optional[str] = None,
prompt_stop: Optional[List[str]] = None,
):
# ...
PROMPT = chat_history + (prompt if prompt else "### Assistant:")
PROMPT_STOP = prompt_stop if prompt_stop else ["### Assistant:", "### Human:"]
# ...
The proposal maintains backward compatibility since both new parameters are optional and will use existing hard-coded values as defaults.
In absence of custom prompts, the system could default to prompts styled after Llama-2's structure as a sane default:
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """You are a helpful assistant."""
my_custom_prompt = ">>> Custom Assistant:"
my_custom_stop = [">>> Custom Assistant:", ">>> Custom User:"]
create_chat_completion(
messages=...,
prompt=my_custom_prompt,
prompt_stop=my_custom_stop,
# ...other parameters...
)
This proposal aims to integrate well with ongoing work in the configurable-chat-templates
branch and issues like #711, focusing on handling bos
and eos
tokens.
I'm not sure if this fits well with what you guys had in mind. Let me know either way. I had the same idea though.
I was looking Open Interpreter and the source code was using litellm. So, I figured I'd take a peek at it and vllm.
I checked out the docs for litellm templates and they have a fairly nice structure for prefixing and postfixing.
# Create your own custom prompt template works
litellm.register_prompt_template(
model="togethercomputer/LLaMA-2-7B-32K",
roles={
"system": {
"pre_message": "[INST] <<SYS>>\n",
"post_message": "\n<</SYS>>\n [/INST]\n"
},
"user": {
"pre_message": "[INST] ",
"post_message": " [/INST]\n"
},
"assistant": {
"post_message": "\n"
}
}
)
def test_huggingface_custom_model():
model = "huggingface/togethercomputer/LLaMA-2-7B-32K"
response = completion(model=model, messages=messages, api_base="https://ecd4sb5n09bo4ei2.us-east-1.aws.endpoints.huggingface.cloud")
print(response['choices'][0]['message']['content'])
return response
test_huggingface_custom_model()
Found it pretty interesting because you can feed in the structure as a dict
and then grab the values by the keys.
I ran through it with GPT again and this is what it came up with as a proof-of-concept.
The current chat completions implementation relies on hard-coded prompts, limiting customization and flexibility. This is a bottleneck when adapting the code to specialized projects requiring unique role-based prompt styles or formats.
Replace existing prompt
and prompt_stop
with a single role_templates
parameter to the create_chat_completion
method. This will offer users the capability to specify custom role-based formatting for different parts of the conversation.
def create_chat_completion(
# ...existing parameters...
role_templates: Optional[Dict[str, Dict[str, str]]] = None
):
# ...existing code...
role_templates
parameter streamlines the API.This change maintains backward compatibility since the role_templates
parameter is optional and defaults to existing hard-coded values if not provided.
A reasonable default could mirror Llama-2's prompt structure:
DEFAULT_ROLE_TEMPLATES = {
"system": {
"pre_message": "[INST] <<SYS>>\n",
"post_message": "\n<</SYS>>\n [/INST]\n"
},
"user": {
"pre_message": "[INST]",
"post_message": " [/INST]\n"
},
"assistant": {
"post_message": "\n"
}
}
I know it won't be that simple after reviewing the code.
Just wanted to share. Maybe it would inspire something.
What models actually use the current chat prompt template?
It seems most models use Alpaca's format:
<System prompt/Character Card>
### Instruction:
Your instruction or question here.
For roleplay purposes, I suggest the following - Write <CHAR NAME>'s next reply in a chat between <YOUR NAME> and <CHAR NAME>. Write a single reply only.
### Response:
@NightMachinery
It depends on the dataset and how it's trained and/or finetuned.
The format varies from model to model, but the 2 most popular formats are usually
"### Instruction:" and "### Assistant:"
and
"### Human:" and "Assistant:"
Sometimes it's "### Human:" and "### Bot:"
Open Assistant uses a mixture depending on version and dataset, "prompter:", or "human:", and "assistant:".
Some models are more complex than others, e.g. it's system prompt, input, instruction, and then response.
There's no fixed, or commonly accepted, format yet as far as I can tell.
Most chat models follow system, user, assistant, or some variation. Whether there are tokens that are used to denote which is which depends.
The closest thing to standard is ChatML. And it's not widely accepted.
I've adopted it, and open assistant has adopted it. Vicuna and wizardLM haven't.
Hopefully a consensus emerges in the next year.
@ehartford
I'm for ChatML.
The high-level interface is intuitive and easy to reason about and follow.
The low-level interface is similar to what Meta did with Llama-2's chat interface.
The tokens could probably be simplified though. Maybe the use of something more like markup would be an improvement?
<system>System prompt goes here</system>
<user>User prompt goes here</user>
<assistant>
And just "teach" the model that </tag>
is always the stop token for that token sequence.
The the output could be parsed similar to XML/HTML.
I'm still learning, so just take what I'm saying with a friendly grain of salt.
This is something I plan on experimenting with if I get the opportunity to do it in the future.
I agree though, a consensus would be nice.
What models actually use the current chat prompt template?
It seems most models use Alpaca's format:
<System prompt/Character Card> ### Instruction: Your instruction or question here. For roleplay purposes, I suggest the following - Write <CHAR NAME>'s next reply in a chat between <YOUR NAME> and <CHAR NAME>. Write a single reply only. ### Response:
The current scheme implemented in llama-cpp-python doesn't follow a convention I know of.
Please see the links in my original issue for a comprehensive and detailed list of the currently popular prompt templates.
90%+ of use cases will be covered if the following formats are supported:
The best source of documentation on these prompt formats is probably the model cards in TheBloke's distributions which are very well researched.
Hey @ehartford I just merged in the #711 PR which adds a mechanism to specify common chat formats through a chat_format
parameter on the Llama
class and the server settings.
Currently supports:
Let me know if that works for you!
Nice!
ChatML would be lovely it's garnering more support
I've noted this as well and it's great to see support being added just now. But looking at the code, it seems as if there is room to be a bit more flexible and customizable.
For example, in LocalAI they allow people to add Yaml files with a configuration preset for each model. I really like their idea in general.
Maybe it would be an option for the future to have something similar. Instead of having everything being fixed into the code, allow people to add a Yaml file option and pass the content of the file into a format_from_yaml
like function.
I've noted this as well and it's great to see support being added just now. But looking at the code, it seems as if there is room to be a bit more flexible and customizable.
For example, in LocalAI they allow people to add Yaml files with a configuration preset for each model. I really like their idea in general.
Maybe it would be an option for the future to have something similar. Instead of having everything being fixed into the code, allow people to add a Yaml file option and pass the content of the file into a
format_from_yaml
like function.
Or better yet, accept a general lambda as an argument and implement the YAML idea as a specific lambda that can take a YAML file and template the response. E.g.,
chat_template_fn=partial(yaml_template_reader, yaml_path="...")
Anything besides yaml, please 🙏. Simple is always better.
@ehartford I'll add that and a few others I missed (mistral as well).
@r7l I'll consider this but likely as a utility for the server that converts a config file / template into a chat formatting function.
Thank you for your work on chat templates and llama-cpp-python generally!!
Curious if you could just universally piggy back the HuggingFace template hub or let users specify a tokenizer_config.json to completely outsource it to this developing standard of rendering arbitrary Jinja provided with the model? I would be surprised if new model releases don't all start coming with their own tokenizer config definition.
EDIT: Delivered the above in linked PR
@abetlen
The Prompt template: ChatML does not stop generating when using Prompt template: ChatML. I think we need to add a stop token. stop_str = "<|im_end|>"
and return ChatFormatterResponse(prompt=_prompt, stop=stop_str)
It worked for me.
@register_chat_format("chatml")
def format_chatml(
messages: List[llama_types.ChatCompletionRequestMessage],
**kwargs: Any,
) -> ChatFormatterResponse:
system_template = """<|im_start|>system
{system_message}"""
system_message = _get_system_message(messages)
system_message = system_template.format(system_message=system_message)
_roles = dict(user="<|im_start|>user", assistant="<|im_start|>assistant")
_sep = "<|im_end|>"
stop_str = "<|im_end|>"
_messages = _map_roles(messages, _roles)
_messages.append((_roles["assistant"], None))
_prompt = _format_chatml(system_message, _messages, _sep)
return ChatFormatterResponse(prompt=_prompt, stop=stop_str)
Default stop token would be huge for realizing transparent model provider.
There's a problem with using the stop tokens.
I'm not sure what the difference is yet, but I noticed that using the special tokens in the user facing templates causes a lot of issues.
I would advise not using special tokens at all with llama.cpp. In almost every test I conducted, the models started repeating themselves, derailing, and more.
Using the base template seems to work beautifully though. Not a single issue once I do that.
So you exclude them from the template, but still you would set it as a default stop sequence item right? Saves you from having to specify it in the payload. That will be needed to realize total model backing transparency to fully decouple model from chat consumer other than maybe max token in payload
I noted it in my PR on L14.
Omitted for brevity [...]
Special tokens are crucial for the model's underlying operations, impacting pre-training, fine-tuning, and low-level inference processes. Users should avoid modifying special tokens to prevent issues in the model's output during inference. These issues may manifest as token fixation, repetitive language patterns, contextual derailment, and hallucinations. Improper use of separators and templates can exacerbate these problems.
Example using the llama-2 model and its templating schema:
1 <<SYS>>My name is Llama and I am a helpful assistant.<</SYS>>$
2 [INST] Hello Llama, my name is User. What's your name? [/INST]$
3 Hello User, my name is Llama. Nice to meet you!$
4 [INST] What can you do? [/INST]$
5 I can assist you with various tasks, including providing structured output for certain queries.$
6 [INST] How can you assist me in my programming projects? [/INST]$
7 $
This initial example is a proper template format that the model understands. It results in proper output and does not confuse the model.
1 <<SYS>>My name is Llama and I am a helpful assistant.<</SYS>>$
2 <s>[INST] Hello Llama, my name is User. What's your name? [/INST]$
3 Hello User, my name is Llama. Nice to meet you!</s>$
4 <s>[INST] What can you do? [/INST]$
5 I can assist you with various tasks, including providing structured output for certain queries.</s>$
6 <s>[INST] How can you assist me in my programming projects? [/INST]$
7 $
This example includes the use of special tokens, and the model may or may not use these tokens as a result. The model is not expecting them during inference, which causes unexpected behavior.
1 <<SYS>>My name is Llama and I am a helpful assistant.<</SYS>>$
2 $
3 <s>[INST] Hello Llama, my name is User. What's your name? [/INST]$
4 Hello User, my name is Llama. Nice to meet you!</s>$
5 $
6 <s>[INST] What can you do? [/INST]$
7 I can assist you with various tasks, including providing structured output for certain queries.</s>$
8 $
9 <s>[INST] How can you assist me in my programming projects? [/INST]$
10 $
This example is improperly formatted and causes the model to become confused. The model begins to fixate on tokens, uses language repetition, and eventually derails.
Note that the $
symbols are substitutes for newline characters, e.g. \n
. They're part the output for cat -A
.
I propose we use a dedicated library for this: chatformat. The functionality to format chat prompts is not specific to this project. Creating a shared library will help other developers, but also help us by attracting contributions from outside. It's a win-win.
Additionally, what I'm missing with the current implementation is the possibility to "preface" the model's output. Prefacing means "putting words into the model's mouth".
The issue is that the current implementation seals off the last incomplete message:
USER: How do you feel?
ASSISTANT: I feel </s>
The </s>
should not be there.
Chatformat leaves the last assistant message open by default.
we should probably use jinja templates, so the user can specify them at runtime if needed. engines won't know the template to use. users can have models they just finished fine tuning with custom grammars, etc.
this will get everyone what they want.
@earonesty Good point about having custom templates. But I think using a templating engine is overcomplicating the matter. These chat formats generally consist of "rounds" that are stacked together.
A round is defined as
We can cover 99% of all possible formats by
So for example for Alpaca, the format can be defined as:
alpaca:
with_system: |-
{system}
### Instruction:
{user}
### Response:
{assistant}</s>
without_system: |-
### Instruction:
{user}
### Response:
{assistant}</s>
round_seperator: "\n\n"
If you know of a format that is not covered by this convention, please comment.
@Mwni
If you know of a format that is not covered by this convention, please comment.
Llama-1, Llama-2, RedPajama, Mistral, Refact, etc...
jinja2 is incredibly easy and lightweight
and it's 100% compatible with what all current researchers are producing in there token configuration files
On Mon, Nov 6, 2023, 9:48 AM Austin @.***> wrote:
@Mwni https://github.com/Mwni
Llama-1, Llama-2, RedPajama, Mistral, etc...
— Reply to this email directly, view it on GitHub https://github.com/abetlen/llama-cpp-python/issues/717#issuecomment-1795000465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMUIVYH5DXJBOI4WXTXTYDD2FNAVCNFSM6AAAAAA4YHXNRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJVGAYDANBWGU . You are receiving this because you were mentioned.Message ID: @.***>
@teleprint-me The following prompts were generated using the proposed scheme
Llama-2
<s>[INST] <<SYS>>
You are a very clever LLM.
<</SYS>>
Hello? [/INST] Hello.</s><s>[INST] What are you thinking? [/INST] I think that
Vicuna (and Mistral)
You are a very clever LLM.
USER: Hello?
ASSISTANT: Hello.</s>
USER: What are you thinking?
ASSISTANT: I think that
ChatML
<|im_start|>system
You are a very clever LLM.<|im_end|>
<|im_start|>user
Hello?<|im_end|>
<|im_start|>assistant
Hello.<|im_end|>
<|im_start|>user
What are you thinking?<|im_end|>
<|im_start|>assistant
I think that
Where's the problem?
the problem is that jinja2 is what is sitting in hf config files. so it's future compatible with stuff you haven't heard of, and can be sucked into gguf file metadata, so that the user isn't on the hook to specify a template when working with gguf files. it has a forward compatibility path that matters.
Having it load right from the meta data would be killer.
@earonesty
Where in the specification is that? Also, ggerganov already stated he plans on using oblique templates and it will be a minimal, separate, implementation.
gguf allows you to store any metadata you want. models at hf have jinja2 templates in their tokenizer configs., so really, it doesn't matter about the specification that much. can just add it to the convert script.
So can I define a custom format like:
"<|start_header_id|>{name}<|end_header_id|>\n\n"
How? So far I have a dropdown box that selects the pre-defined formats.
Is your feature request related to a problem? Please describe. When generating chat completion, it is hard-coded to generate a non-standard prompt template that looks something like:
system message is currently ignored.
https://github.com/abetlen/llama-cpp-python/blob/255d653ae3bd08c690dbf01f9533daf75f71217c/llama_cpp/llama.py#L1578
This mostly works for most models. But it's not correct.
Describe the solution you'd like
Describe alternatives you've considered modifying llama-cpp-python to hard code it to llama2-chat format, not a great solution.
Additional context