ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.64k stars 9.26k forks source link

Feature Request: server : make chat_example available through /props endpoint #8694

Open mgroeber9110 opened 1 month ago

mgroeber9110 commented 1 month ago

Prerequisites

Feature Description

Currently there is no way of retrieving information about the recommended chat template for a model when using the /completion endpoint of the server. The idea of this feature is to add a new property chat_example to the /props endpoint that returns the same information that is already logged when the server starts up:

chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n"

This goes in a same direction as the suggestion in #5447 (closed as "stale"), but now explicitly uses the ability to read templates from models and execute them with llama_chat_apply_template.

Motivation

This came up in #8196 as a way of accessing the built-in templates also from the server UI, with a formatted chat being potentially easier to work with than a full jinja2 template, at least for common cases where the template is not too complex. This would enable a number of extensions to using templates with the /completion endpoint:

Possible Implementation

ngxson commented 1 month ago

You can already access chat_template from /props endpoint: https://github.com/ggerganov/llama.cpp/pull/8337

IMO we can start to search for lightweight jinja parser in JS that can fit on the project. If we can't find one, then let's do the chat_example way.

mgroeber9110 commented 1 month ago

I have done a bit of research regarding JS implementations in Jina. So far, I am aware of two candidates:

I have tested several templates from HF models with the Online Demo of jinja-js, and so far found the following features being unsupported (while being used in some templates):

Some of these should probably be easy to add, but of course we cannot guarantee full compatibility with Python expressions. Overall, I am getting about 50% of templates working out-of-the-box.

ngxson commented 1 month ago

Overall, I am getting about 50% of templates working out-of-the-box.

IMO I'd prefer not to deliver a half-working version.

In any cases, I don't see many real benefits of having a jinja parser on llama.cpp server web UI, given that we already have /chat/completions endpoint that can handle chat formatting in cpp code.

There is already one implementation of such web UI, but it kinda one-off usage (some people just want to push their code into the repo without thinking about long-term impact). Would be nice if we can somehow integrate it into the main UI: https://github.com/ggerganov/llama.cpp/tree/master/examples/server/public_simplechat

mgroeber9110 commented 1 month ago

Sounds good to me. Is there already a long-term plan somewhere for what the server UI should look like long-term. I could see something like this:

Switching between the first two options could be a checkbox below "chat" mode.

Does this make sense to track such a plan (or a similar one) as a separate issue?