ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.59k stars 9.85k forks source link

Server: possibility of customizable chat template? #5922

Closed ngxson closed 5 months ago

ngxson commented 8 months ago

Motivation

While we already have support for known chat templates, it sometimes not enough for users who want to:

The problem is that other implementations of chat template out there are also quite messy, for example:

Possible implementation

My idea is to have a simple JSON format that take into account all roles:

{
  "system": {
    "prefix": "<|system|>\n",
    "postfix": "<|end|>\n"
  },
  "user": {
    "prefix": "<|user|>\n",
    "postfix": "<|end|>\n"
  },
  "assistant": {
    "prefix": "<|assistant|>\n",
    "postfix": "<|end|>\n"
  },
  "_stop": ["<|end|>"],
  "_generation": "<|assistant|>\n",
}

User can specify the custom template via --chat-template-file ./my_template.json

The cpp code will be as simple as:

std::string apply_custom_template(json messages, json tmpl) {
  std::stringstream ss;
  for (auto & msg : messages) {
    json t = tmpl[msg["role"]];
    ss << t["prefix"] << msg["content"] << t["postfix"];
  }
  ss << tmpl["_generation"]; // add generation prompt
  return ss.str();
}

NOTE: This function does not take into account models that does not support system prompt for now, but this function can be added in the future, maybe toggle via an attribute inside json "system_inside_user_message": true

Ref:

ggerganov commented 8 months ago

Users that want to support a certain template should open a PR and implement it in the framework that we already have

ngxson commented 8 months ago

Users that want to support a certain template should open a PR and implement it in the framework that we already have

Yeah I thought that would be ideal to do so, but sometimes that's not even enough: maybe user want to fine tune and try out one single template?

Another problem is that currently there're models don't have jinja template, for example the old alpaca ### Instruction (that I don't feel like it should exist in our code base - another problem is that we cannot support stop word for this template). Yet some users still using that.

The current proposed solution is to use /completions with a custom proxy, I've already mentioned it in wiki page

My proposal in this PR is not complete, so I'll let it here to see if anyone come up with another use case (or another idea) that we've never though about for example.

teleprint-me commented 8 months ago

Yeah, this is why I said templates should be the responsibility of the user.

It's why I always use the completions endpoint and avoid any chat template enforcement. The problem is simple to understand once you understand how a tokenizer is created and trained. ANY token can be used. ANY TOKEN. This ambiguity is problematic for requiring even a loose set of rules.

This problem will exist even if the industry agrees upon a standard. ChatML is nice, but it's only a single use case and doesn't solve the wider issue at hand. This also makes completions more valuable because they're the most flexible. Completions also set the foundation or stage for chat fine-tuning. It sucks that OpenAI took it down, but it's what I always use when I use llama.cpp.

Completions are literally the only reason I use llama.cpp. There's so much more flexibility that way. Just put the responsibility on the user, end of discussion. This isn't a conclusion I came to lightly. It took time, research, and experimentation to figure this out. This is why I backed and proposed this idea.

This is the most viable solution for the interim, and even then, this solution fails miserably with FIM and other templates. It's not that I think this is an impossible problem, but it is the type of problem that will create an entirely different set of problems that compound one another and eventually becomes a bottleneck in the best case scenario.

I really do understand the temptation here, but it's best avoided.

ngxson commented 8 months ago

I really do understand the temptation here, but it's best avoided.

Thanks for your input. For clarification, I'm not saying that my proposal solves all the issues we have with chat templates in general. If I was that confident, I could have just made a PR instead.

I also not assuming that supporting custom chat template is a good idea or not. I'm still learning here. I understand your point. It's a valid reason not to have this feature and I appreciate that.

However, I will still keep this issue open for a while to collect some feedback, maybe will be helpful if we change the decision in the future.

teleprint-me commented 8 months ago

I think this is the middle of the road solution which is good.

I just keep reiterating it because the tokens are dictated by the tokenizer and the settings used to train the tokenizer. Then the chat template is fine-tuned with any added tokens.

All of the tokens for the model are (usually, but unfortunately not always) in there, e.g. tokenizer.model or tokenizer.json. It was really interesting and fun learning how this worked.

So, to clarify, I support your proposal. Once the tokenizer is understood, managing the templates for the model becomes more intuitive.

A really great way to get a feel for it is with completions.

ngxson commented 8 months ago

I came across https://github.com/ollama/ollama/issues/1977 and feeling like we're in the middle of a "war of template". You're right @teleprint-me , there's temptation, but better to avoid it at least in this stage.

Edit: Still I'm feeling quite lucky because in llama.cpp we have test-chat-template.cpp to make sure that the template works 100% the same with the original jinja version.

teleprint-me commented 8 months ago

I'd love to have it automated, it would be great. I forgot where I stated it, but I remember reiterating that this is similar to "Hilbert's paradox of the Grand Hotel" which "is a thought experiment which illustrates a counterintuitive property of infinite sets".

This issue arises because of the desire to support many models with a variety of templates. Model developers can choose to set up the template however they'd like and so can fine-tuners.

The moment you begin baking in special tokens, chat templates, and more, is the moment you've bound yourself to an uncanny solution that exponentially becomes more difficult to manage over time. You'll always need to accommodate another "guest".

The simplest solution is to create an API or Framework that developers can plugin to. @ggerganov actually suggested this same solution awhile ago. I recommended this solution multiple times. I've been advocating to place the chat template as the responsibility of the user. My rationale is to keep the code and API simple and digestible.

I'm confident that there is a way to find a middle ground, but we'll need to work towards that middle ground. I think your idea is actually sound and the reason is because it's simple and flexible. The motto around here seems to be to not over engineer, but supporting chat templates will require much more than over engineering and this doesn't include the maintenance that will ensue as a result. It has technical debt written all over it.

I think using the prefix and postfix for prompts is probably the best we can do until templates become solidified. It's still early and we're just getting started. It's better to observe and learn as we progress. Once a pattern emerges, we can use that as an anchor.

teleprint-me commented 8 months ago

As an aside, I'd love to build a custom tokenizer for llama.cpp. I think it would be great. We could use it for training and fine-tuning. I haven't looked at the backend lately, but back-propagation would obviously help for updating the weights. What would be really neat is training and fine-tuning quants. If I remember correctly, the softmax outputs the logits and we just update backward pass with cross-entropy (feel free to correct me, I'm still learning). Now that would be really cool :)

kaizau commented 7 months ago

Re: https://github.com/ggerganov/llama.cpp/issues/6726#issuecomment-2065729990

@ngxson The main problem is that even with this level of flexibility, some templates can't be supported without doing some code logic (for example, llama 2 template [INST] with <> system message).

My hunch is that code logic in templates can still be avoided, if the configuration provides enough flexibility. For example, providing an alternate template based on message index:

{
  "system": {
    "prefix": "<s>[INST] <<SYS>>\n",
    "postfix": "\n<<SYS>>\n\n"
  },
  "user_1": {
    "prefix": "",
    "postfix": " [/INST] "
  },
  "user": {
    "prefix": "<s>[INST] ",
    "postfix": " [/INST] "
  },
  "assistant": {
    "prefix": "",
    "postfix": " </s>"
  }
}

Or more generally:

{
  ...
  "user": {
    "prefix": "<s>[INST] ",
    "postfix": " [/INST] "
    "override": {
      "1": {
        "prefix": ""
      }
    }
  },
  ...
}

Though I wonder if a 1-off workaround for first user message might even be enough.

kaizau commented 7 months ago

Re: https://github.com/ggerganov/llama.cpp/issues/6726#issuecomment-2065440493

@bullno1 Sounds cool and i'd say take it further, why even template or search&replace within a role? Just change it to "prefix" and "suffix":

Might just be me, but I slightly prefer the aesthetic / concise legibility of seeing the entire message in context:

{
  "system": "<s>[INST] <<SYS>>\n{{content}}<<SYS>>\n\n",
  "user_1": "{{content}} [/INST] ",
  "user": "<s>[INST] {{content}} [/INST] ",
  "assistant": "{{content}} </s>"
}

Not a big deal, because both are more legible than a string of escaped Jinja.

As for injection risk, this wouldn't need to execute code — just do a string replacement. Maybe I'm overlooking something here?

kaizau commented 7 months ago

@teleprint-me Seems like chat templates brushes up against a macro question around the ideal scope of llama.cpp and this server "example" in general.

But whether the chat endpoint is here or elsewhere, configurable chat templates is a very natural extension. Anything else is an incomplete solution which forces clients (or users) to implement their own workarounds on completions if they want to support the vast and growing number of templates.

ngxson commented 7 months ago

Might just be me, but I slightly prefer the aesthetic / concise legibility of seeing the entire message in context

@kaizau I personally prefer having postfix/prefix explicitly, since it makes the cpp code more readable. I think the format you proposed is more suitable for higher level programming languages, since the parser can be just one or two line of codes.

teleprint-me commented 7 months ago

@kaizau I agree with your assessment.

bullno1 commented 7 months ago

Re: #6726 (comment)

@bullno1 Sounds cool and i'd say take it further, why even template or search&replace within a role? Just change it to "prefix" and "suffix":

Might just be me, but I slightly prefer the aesthetic / concise legibility of seeing the entire message in context:

{
  "system": "<s>[INST] <<SYS>>\n{{content}}<<SYS>>\n\n",
  "user_1": "{{content}} [/INST] ",
  "user": "<s>[INST] {{content}} [/INST] ",
  "assistant": "{{content}} </s>"
}

Not a big deal, because both are more legible than a string of escaped Jinja.

As for injection risk, this wouldn't need to execute code — just do a string replacement. Maybe I'm overlooking something here?

The injection risk I was talking about is more about user input containing special tokens like: <|start_of_turn|>, <|eot_id|>... When we tokenize the prefix/suffix markers separately, user message can be tokenized with parse_special=false and those special tags will appear as literal strings instead of special tokens.

When templated into a string parse_special has to be true and it's easy to put words into the other role's mouth.

hanishkvc commented 7 months ago

Please do have a look at the code in the below PR. Around the time when llama3 came out, I had a need to look at llama.cpp and inturn I worked on below, to try and see if one can have a generic flow which is driven by a config file to try and accomodate different modes/chat-handshake-template-standards in a generic and flexible way. The idea being that if a new template standard is added during finetuning of a model or if a new model or standard comes out, but which follows a sane convention matching the commonality that I have noticed across many models/standards, then the generic code flow itself can be used, by just updating the config file, without having to add a custom template block.

This inturn can be used by example/main, example/server, as well as other users of llama.cpp library. Currently main has been patched to use this config file based flow inturn piggy backing on its existing interactive mode and its in-prefix, in-suffix, antiprompt to a great extent.

https://github.com/ggerganov/llama.cpp/pull/6834

Based on some minimal testing at my end, I seem to be able to handle the nitty gritties of around 8(+1) model using this generic code + config file based flow.

Currently json format is used for the config file, but if needed can be switched to a simpler text based config file, to avoid users of the llama.cpp library from needing to depend on json library.

The generic code flow uses a concept similar to what this PR is also thinking ie a generic code flow driven by a config file.

Also the generic flow additionally takes care of

You can look at the examples/chaton_meta.json which has entries for the 8(+1) models/standard, which I have tested with my patch.

JoakimCh commented 6 months ago

teleprint-me: Yeah, this is why I said templates should be the responsibility of the user.

Agreed, which I asked about here issues/6982.

As ngxson pointed out "the code is so simple", we can write it ourselves in whatever frontend we use.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.