Feature Request: server : make chat_example available through /props endpoint

mgroeber9110 commented 3 months ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Currently there is no way of retrieving information about the recommended chat template for a model when using the /completion endpoint of the server. The idea of this feature is to add a new property chat_example to the /props endpoint that returns the same information that is already logged when the server starts up:

chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n"

This goes in a same direction as the suggestion in #5447 (closed as "stale"), but now explicitly uses the ability to read templates from models and execute them with llama_chat_apply_template.

Motivation

This came up in #8196 as a way of accessing the built-in templates also from the server UI, with a formatted chat being potentially easier to work with than a full jinja2 template, at least for common cases where the template is not too complex. This would enable a number of extensions to using templates with the /completion endpoint:

A matching chat template could be chosen from the library in the server UI by testing available templates and finding the best match for the chat example.
Alternatively, a template that is selected by the user in the UI could be verified against the template used internally by the server. The UI could show a warning if there is a discrepancy. This would help to highlight and debug inconsistencies in templates.
For a template that is not too "weird", it may be possible to deduct its correct usage directly from the chat example by filling system prompt and queries into the template, and then looping the last two conversation turns ("assistant" and "user").

Possible Implementation

When the server starts up, store the chat_example string (that is already logged) for future use.
When building the /props response, return the chat_example string at the top level, next to the system_prompt.

ngxson commented 3 months ago

You can already access chat_template from /props endpoint: https://github.com/ggerganov/llama.cpp/pull/8337

IMO we can start to search for lightweight jinja parser in JS that can fit on the project. If we can't find one, then let's do the chat_example way.

mgroeber9110 commented 3 months ago

I have done a bit of research regarding JS implementations in Jina. So far, I am aware of two candidates:

jinja-js appears to be the most promising alternative. It is fairly lightweight at 577 LOC in a single source file, with no obvious dependencies. It has not been maintained for about 10 years, however.
JinJS looks more full-featured, but also has been inactive for 12 years. Also, it consists of multiple CoffeeScript files with external dependencies.

I have tested several templates from HF models with the Online Demo of jinja-js, and so far found the following features being unsupported (while being used in some templates):

raise_exception
.strip()
|trim filter
range(0, messages|length)
Python style array slicing: messages[1:]

Some of these should probably be easy to add, but of course we cannot guarantee full compatibility with Python expressions. Overall, I am getting about 50% of templates working out-of-the-box.

ngxson commented 3 months ago

Overall, I am getting about 50% of templates working out-of-the-box.

IMO I'd prefer not to deliver a half-working version.

In any cases, I don't see many real benefits of having a jinja parser on llama.cpp server web UI, given that we already have /chat/completions endpoint that can handle chat formatting in cpp code.

There is already one implementation of such web UI, but it kinda one-off usage (some people just want to push their code into the repo without thinking about long-term impact). Would be nice if we can somehow integrate it into the main UI: https://github.com/ggerganov/llama.cpp/tree/master/examples/server/public_simplechat

mgroeber9110 commented 3 months ago

Sounds good to me. Is there already a long-term plan somewhere for what the server UI should look like long-term. I could see something like this:

Deprecate old UI in favor of new one (e.g. open new index page by default), so only one needs to be maintained.
Have three modes:
- Chat with /chat/completion endpoint (default, for most users)
- Chat with /completion endpoint (with the existing template code in new UI, for interactive experiments with custom template tweaks)
- Pure text completion, for models without instruction tuning

Switching between the first two options could be a checkbox below "chat" mode.

Does this make sense to track such a plan (or a similar one) as a separate issue?

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp