abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.72k stars 930 forks source link

One chat prompt/template should be customizable from runtime - the prompt I need atm #816

Open pervrosen opened 11 months ago

pervrosen commented 11 months ago

Is your feature request related to a problem? Please describe. Since the speedy evolement of LLMs and their various use of prompts, we need a way to specify a prompt from runtime, not only by code, eg mistral->mistral+orca->mistral-zephyr within a week

Describe the solution you'd like An endpoint to override a prompt or add a runtime prompt for the model I am currently evaluating

Describe alternatives you've considered submitting a pull request for each end ever LLM I encounter OR polling the "huggingface.co/docs/transformers/main/en/chat_templating"-feature for the model in question, but that introduces one more dependency

Additional context An example: First, I run everything in a docker/k8s setup. I call upon an endpoint to specify the prompt, if not already specified in code. That populates a runtime version of

´´´ @register_chat_format("mistral") def format_mistral( messages: List[llama_types.ChatCompletionRequestMessage], **kwargs: Any, ) -> ChatFormatterResponse: _roles = dict(user="[INST] ", assistant="[/INST]") _sep = " " system_template = """{system_message}""" system_message = _get_system_message(messages) system_message = system_template.format(system_message=system_message) _messages = _map_roles(messages, _roles) _messages.append((_roles["assistant"], None)) _prompt = _format_no_colon_single(system_message, _messages, _sep) return ChatFormatterResponse(prompt=_prompt) ´´´

that I can use for calling the model.

tranhoangnguyen03 commented 11 months ago

What if multiple users are using the server? Does it mean everyone gets to modify the same shared runtime prompt template, or does the server have to keep a different prompt template for each user?

pervrosen commented 11 months ago

If that is a problem, perhaps one could limit usage by a feature flag. This would make sense especially since I suspect that when a model hits production a static version would have been implemented. Do you agree?

tranhoangnguyen03 commented 11 months ago

Looks like you can build upon or contribute to this similar effort: https://github.com/abetlen/llama-cpp-python/pull/809

tranhoangnguyen03 commented 11 months ago

@pervrosen do you know what is the behavior of --chat_format on /v1/completion endpoint right now? It defaults to "llama-2". Does that mean every prompt sent to the endpoint is automatically applied llama-2 prompt format?

pervrosen commented 11 months ago

If you search for chat_format in the repo you will find that some 8-10 models have prompt templates implemented. Hope the image I attached shows what I mean. Rgs

image

earonesty commented 10 months ago

feels like the "chat_format" parametaer could just be jinja template. then just execute it. user can specify whatever .

for example this is "mistral openorca"

"{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<\|im_start\|>' + message['role'] + '\n' + message['content'] + '<\|im_end\|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<\|im_start\|>assistant\n' }}{% endif %}",

if everyone is cool with that i will add it (i will detect a jinja template as the chat format param, and execute it ... while still supporting the python registered formats as is)

then we can add that to gguf files in the metadata (absorb it during convert.py), and be done with having to think about templates!

pervrosen commented 10 months ago

@earonesty I think this is exactly what they are discussing in #809 (link in an above comment). Perhaps you can contribute on the choice of arcitectural pattern that seems to be under debate there.