Closed ngxson closed 5 months ago
Users that want to support a certain template should open a PR and implement it in the framework that we already have
Users that want to support a certain template should open a PR and implement it in the framework that we already have
Yeah I thought that would be ideal to do so, but sometimes that's not even enough: maybe user want to fine tune and try out one single template?
Another problem is that currently there're models don't have jinja template, for example the old alpaca ### Instruction
(that I don't feel like it should exist in our code base - another problem is that we cannot support stop word for this template). Yet some users still using that.
The current proposed solution is to use /completions
with a custom proxy, I've already mentioned it in wiki page
My proposal in this PR is not complete, so I'll let it here to see if anyone come up with another use case (or another idea) that we've never though about for example.
Yeah, this is why I said templates should be the responsibility of the user.
It's why I always use the completions endpoint and avoid any chat template enforcement. The problem is simple to understand once you understand how a tokenizer is created and trained. ANY token can be used. ANY TOKEN. This ambiguity is problematic for requiring even a loose set of rules.
This problem will exist even if the industry agrees upon a standard. ChatML is nice, but it's only a single use case and doesn't solve the wider issue at hand. This also makes completions more valuable because they're the most flexible. Completions also set the foundation or stage for chat fine-tuning. It sucks that OpenAI took it down, but it's what I always use when I use llama.cpp.
Completions are literally the only reason I use llama.cpp. There's so much more flexibility that way. Just put the responsibility on the user, end of discussion. This isn't a conclusion I came to lightly. It took time, research, and experimentation to figure this out. This is why I backed and proposed this idea.
This is the most viable solution for the interim, and even then, this solution fails miserably with FIM and other templates. It's not that I think this is an impossible problem, but it is the type of problem that will create an entirely different set of problems that compound one another and eventually becomes a bottleneck in the best case scenario.
I really do understand the temptation here, but it's best avoided.
I really do understand the temptation here, but it's best avoided.
Thanks for your input. For clarification, I'm not saying that my proposal solves all the issues we have with chat templates in general. If I was that confident, I could have just made a PR instead.
I also not assuming that supporting custom chat template is a good idea or not. I'm still learning here. I understand your point. It's a valid reason not to have this feature and I appreciate that.
However, I will still keep this issue open for a while to collect some feedback, maybe will be helpful if we change the decision in the future.
I think this is the middle of the road solution which is good.
I just keep reiterating it because the tokens are dictated by the tokenizer and the settings used to train the tokenizer. Then the chat template is fine-tuned with any added tokens.
All of the tokens for the model are (usually, but unfortunately not always) in there, e.g. tokenizer.model
or tokenizer.json
. It was really interesting and fun learning how this worked.
So, to clarify, I support your proposal. Once the tokenizer is understood, managing the templates for the model becomes more intuitive.
A really great way to get a feel for it is with completions.
I came across https://github.com/ollama/ollama/issues/1977 and feeling like we're in the middle of a "war of template". You're right @teleprint-me , there's temptation, but better to avoid it at least in this stage.
Edit: Still I'm feeling quite lucky because in llama.cpp we have test-chat-template.cpp
to make sure that the template works 100% the same with the original jinja version.
I'd love to have it automated, it would be great. I forgot where I stated it, but I remember reiterating that this is similar to "Hilbert's paradox of the Grand Hotel" which "is a thought experiment which illustrates a counterintuitive property of infinite sets".
This issue arises because of the desire to support many models with a variety of templates. Model developers can choose to set up the template however they'd like and so can fine-tuners.
The moment you begin baking in special tokens, chat templates, and more, is the moment you've bound yourself to an uncanny solution that exponentially becomes more difficult to manage over time. You'll always need to accommodate another "guest".
The simplest solution is to create an API or Framework that developers can plugin to. @ggerganov actually suggested this same solution awhile ago. I recommended this solution multiple times. I've been advocating to place the chat template as the responsibility of the user. My rationale is to keep the code and API simple and digestible.
I'm confident that there is a way to find a middle ground, but we'll need to work towards that middle ground. I think your idea is actually sound and the reason is because it's simple and flexible. The motto around here seems to be to not over engineer, but supporting chat templates will require much more than over engineering and this doesn't include the maintenance that will ensue as a result. It has technical debt written all over it.
I think using the prefix and postfix for prompts is probably the best we can do until templates become solidified. It's still early and we're just getting started. It's better to observe and learn as we progress. Once a pattern emerges, we can use that as an anchor.
As an aside, I'd love to build a custom tokenizer for llama.cpp. I think it would be great. We could use it for training and fine-tuning. I haven't looked at the backend lately, but back-propagation would obviously help for updating the weights. What would be really neat is training and fine-tuning quants. If I remember correctly, the softmax outputs the logits and we just update backward pass with cross-entropy (feel free to correct me, I'm still learning). Now that would be really cool :)
Re: https://github.com/ggerganov/llama.cpp/issues/6726#issuecomment-2065729990
@ngxson The main problem is that even with this level of flexibility, some templates can't be supported without doing some code logic (for example, llama 2 template [INST] with <
> system message).
My hunch is that code logic in templates can still be avoided, if the configuration provides enough flexibility. For example, providing an alternate template based on message index:
{
"system": {
"prefix": "<s>[INST] <<SYS>>\n",
"postfix": "\n<<SYS>>\n\n"
},
"user_1": {
"prefix": "",
"postfix": " [/INST] "
},
"user": {
"prefix": "<s>[INST] ",
"postfix": " [/INST] "
},
"assistant": {
"prefix": "",
"postfix": " </s>"
}
}
Or more generally:
{
...
"user": {
"prefix": "<s>[INST] ",
"postfix": " [/INST] "
"override": {
"1": {
"prefix": ""
}
}
},
...
}
Though I wonder if a 1-off workaround for first user message might even be enough.
Re: https://github.com/ggerganov/llama.cpp/issues/6726#issuecomment-2065440493
@bullno1 Sounds cool and i'd say take it further, why even template or search&replace within a role? Just change it to "prefix" and "suffix":
Might just be me, but I slightly prefer the aesthetic / concise legibility of seeing the entire message in context:
{
"system": "<s>[INST] <<SYS>>\n{{content}}<<SYS>>\n\n",
"user_1": "{{content}} [/INST] ",
"user": "<s>[INST] {{content}} [/INST] ",
"assistant": "{{content}} </s>"
}
Not a big deal, because both are more legible than a string of escaped Jinja.
As for injection risk, this wouldn't need to execute code — just do a string replacement. Maybe I'm overlooking something here?
@teleprint-me Seems like chat templates brushes up against a macro question around the ideal scope of llama.cpp and this server "example" in general.
But whether the chat
endpoint is here or elsewhere, configurable chat templates is a very natural extension. Anything else is an incomplete solution which forces clients (or users) to implement their own workarounds on completions
if they want to support the vast and growing number of templates.
Might just be me, but I slightly prefer the aesthetic / concise legibility of seeing the entire message in context
@kaizau I personally prefer having postfix/prefix explicitly, since it makes the cpp code more readable. I think the format you proposed is more suitable for higher level programming languages, since the parser can be just one or two line of codes.
@kaizau I agree with your assessment.
Re: #6726 (comment)
@bullno1 Sounds cool and i'd say take it further, why even template or search&replace within a role? Just change it to "prefix" and "suffix":
Might just be me, but I slightly prefer the aesthetic / concise legibility of seeing the entire message in context:
{ "system": "<s>[INST] <<SYS>>\n{{content}}<<SYS>>\n\n", "user_1": "{{content}} [/INST] ", "user": "<s>[INST] {{content}} [/INST] ", "assistant": "{{content}} </s>" }
Not a big deal, because both are more legible than a string of escaped Jinja.
As for injection risk, this wouldn't need to execute code — just do a string replacement. Maybe I'm overlooking something here?
The injection risk I was talking about is more about user input containing special tokens like: <|start_of_turn|>
, <|eot_id|>
...
When we tokenize the prefix/suffix markers separately, user message can be tokenized with parse_special=false
and those special tags will appear as literal strings instead of special tokens.
When templated into a string parse_special
has to be true
and it's easy to put words into the other role's mouth.
Please do have a look at the code in the below PR. Around the time when llama3 came out, I had a need to look at llama.cpp and inturn I worked on below, to try and see if one can have a generic flow which is driven by a config file to try and accomodate different modes/chat-handshake-template-standards in a generic and flexible way. The idea being that if a new template standard is added during finetuning of a model or if a new model or standard comes out, but which follows a sane convention matching the commonality that I have noticed across many models/standards, then the generic code flow itself can be used, by just updating the config file, without having to add a custom template block.
This inturn can be used by example/main, example/server, as well as other users of llama.cpp library. Currently main has been patched to use this config file based flow inturn piggy backing on its existing interactive mode and its in-prefix, in-suffix, antiprompt to a great extent.
https://github.com/ggerganov/llama.cpp/pull/6834
Based on some minimal testing at my end, I seem to be able to handle the nitty gritties of around 8(+1) model using this generic code + config file based flow.
Currently json format is used for the config file, but if needed can be switched to a simpler text based config file, to avoid users of the llama.cpp library from needing to depend on json library.
The generic code flow uses a concept similar to what this PR is also thinking ie a generic code flow driven by a config file.
Also the generic flow additionally takes care of
the conditionality that is seen across few different models wrt tagging of the system-message + 1st user message flow by using related generic flags and also how system-suffix, system-end, user-begin and user-prefix are setup.
the need to differentiate between role-begin, role-prefix, role-suffix and role-end tokens wrt each of the role. And inturn the variation in their insertion or not across different models, but done in a simple and generic way, by just allowing for each of these to be set to some specific string or left as empty string wrt each of the role, as needed by that specific model.
You can look at the examples/chaton_meta.json which has entries for the 8(+1) models/standard, which I have tested with my patch.
teleprint-me: Yeah, this is why I said templates should be the responsibility of the user.
Agreed, which I asked about here issues/6982.
As ngxson pointed out "the code is so simple", we can write it ourselves in whatever frontend we use.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Motivation
While we already have support for known chat templates, it sometimes not enough for users who want to:
The problem is that other implementations of chat template out there are also quite messy, for example:
system
-user
-assistant
, but technically it's possible to have custom roles likedatabase
,function
,search-engine
,...)Possible implementation
My idea is to have a simple JSON format that take into account all roles:
User can specify the custom template via
--chat-template-file ./my_template.json
The cpp code will be as simple as:
NOTE: This function does not take into account models that does not support system prompt for now, but this function can be added in the future, maybe toggle via an attribute inside json
"system_inside_user_message": true
Ref: