ggerganov / llama.cpp

LLM inference in C/C++
MIT License
68.22k stars 9.78k forks source link

server : improvements and maintenance #4216

Open ggerganov opened 12 months ago

ggerganov commented 12 months ago

The server example has been growing in functionality and unfortunately I feel it is not very stable at the moment and there are some important features that are still missing. Creating this issue to keep track on some of these points and try to draw more attention from the community. I guess, some of the tasks are relatively big and would require significant efforts to complete

This is likely not a complete list of things - if you think some feature is important to be improved or supported, drop a comment.

Have a look to issues labelled with server/webui.

IridiumMaster commented 12 months ago

Would love if the server could get look ahead decoding and contrastive search. A collection of common presets would be very helpful for fast model evaluation. The ability to edit responses and replies in the UI would be very useful for rapidly testing prompt branches if combined with batching capabilities. Would also appreciate a simple implementation of request queuing and a server interface for the model training example. Edit: Discussion link for contrastive search : https://github.com/ggerganov/llama.cpp/discussions/3450 , other related topics / potential substitutes are mentioned in the thread.

ruped commented 12 months ago

Thanks for raising this issue and looking into the server example.

I think this #4201 could be relevant - although it sounds like the fix will be in the core code rather than in the server.

Since the addition of support for batching, llama.cpp could be come a viable competitor to vllm for large scale deployments. This is also helpful for individual hobbyists who are using/building AI agents (because these possibly make multiple requests in parallel to the LLMs to construct answers). So I think your suggestions around improving stability/refactor of the server example would be very valuable. Also focusing on the throughput speed particularly of batched requests (and benchmarking this against vllm).

mudler commented 12 months ago

What'd be lovely is to see also the speculative sampling added to it - would be really a great addition there

tobi commented 12 months ago

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

There are 100s of libraries and tools that integrate different subset of backends and inference libraries. Especially in the python world. This doesn't make sense. We need a simple convention by which everything can interopt. The solution is to use openai's API as a protocol on localhost. Could there be better standards? Maybe. But this is the one we have, and it works really well.

My suggestion is that clean we clean up the server and treat it and the /chat/completions endpoint as main deliverable of this repository. We can easily switch the web interface to use that as well. ./server -m ~/model should boot with the ideal default parameters read from the gguf like context size and (if we can pull it off) chat template style.

This means that existing code only needs the api_url override to be modified to work locally.


from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1")

completion = client.chat.completions.create(
  model="llama!",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message.content)

This works already. At least as long as you are loading a model that conforms to chatml and are ok with the default context size. I find that a much better vision for how LLM interopt will work in the open source space. Different servers, different, backends, all on the same proto.

FSSRepo commented 12 months ago

@ggerganov

Batched decoding endpoint?

This option to generate multiple alternatives for the same prompt requires the ability to change the seed, and the truth is, I've been having a bit of a struggle with it when adding parallel decoding, as it raises questions about how the seed should be managed.

spirobel commented 12 months ago

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

studiotatsu commented 12 months ago

The OAI API included with the server, is great I love it. Please include llama_params "repeat_penalty" and "min_p" .

These params are much needed. Thanks.

antcodd commented 12 months ago

I think it would be good if the OAI endpoint supported the same set of parameters and defaults as the regular endpoint and sensible or argument driven defaults given many clients won't supply all parameters.

One issue is the seed is defaulting 0 instead of -1, so every regeneration is the same if the client doesn't specify a seed.

IridiumMaster commented 12 months ago

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

With respect, I think the server endpoint is a different audience. LocalAI seems to be going for an "everything and the kitchen sink" approach. That's cool, and I respect the project, but what I would like from the server example is something different: Raw inference with the greatest number of capabilities at the fastest possible speed, along with tooling specifically designed to allow for large scale prompt testing of different model variants quickly and easily. This is what I would view as more of a "production" workflow as opposed to more of a hobbyist workflow. I agree with the upthread sentiment around making the server api a solid standard @tobi.

mudler commented 12 months ago

@tobi

Very excited about this! I think that the server should increasingly be thought of as the main deliverable of this repo.

LocalAI serves this usecase quite well already and has lots of traction. It is better to not compete with your customers and delight them instead. An easy to link library with a C API should be the main deliverable of this project.

With respect, I think the server endpoint is a different audience. LocalAI seems to be going for an "everything and the kitchen sink" approach. That's cool, and I respect the project, but what I would like from the server example is something different: Raw inference with the greatest number of capabilities at the fastest possible speed, along with tooling specifically designed to allow for large scale prompt testing of different model variants quickly and easily.

Sorry to jump-in in OT, but you are not sacrificing any speed nor capabilities with LocalAI - at the end the engine is always the same (llama.cpp, or vllm, or you name it) - however I see the value of having a server in llama.cpp. It's people's choice at the end of what suits better their needs. And also, the server LocalAI implementation is heavily based on that ;)

This is what I would view as more of a "production" workflow as opposed to more of a hobbyist workflow. I agree with the upthread sentiment around making the server api a solid standard @tobi.

For production there are quite some issues that are blockish-imho rather than this. Had several bugs in LocalAI w/ llama.cpp which makes it still difficult to navigate into that direction, which I hope gets addressed with this ticket. Things like #3969 are quite scary for prod-users.

ruped commented 12 months ago

Just a thought as a user of llama.cpp server: I imagine it's quite common for the llama.cpp Server to be used by developers who are able to add non core functionality in their own code. (e.g. Devs create their own application or library or REST server that wraps/orchestrates llama.cpp). Naturally the llama.cpp server is very convenient for this and works with any programming language. It also has a smaller/self contained API to learn.

I think some of the following can be done in dev's own code outside of llama.cpp:

(Disclaimer: These are just examples, I haven't fully evaluated the pros/cons of implementing them outside of llama.cpp)

It's excellent if this project has the mission and bandwidth to provide functionalities like these. But if it sounds like its becoming too much work or feature creep then I imagine focusing on the bits that are impossible to do outside of llama.cpp is one of the ways to prioritise.

dongxiaolong commented 12 months ago

Hi, @ggerganov .The vllm project has a PR under construction for a chat template that can be used as a reference. https://github.com/vllm-project/vllm/pull/1756

ggerganov commented 12 months ago

Regarding chat templates: I see they are using something called Jinja.

We are not going to implement Jinja in llama.cpp.

The best we can do are 2 things:

Tostino commented 12 months ago

@ggerganov If you are going to hard code templates, this server will be totally unusable for a large number of users. I am experimenting with new templates, and would really rather the models trained with them be widely supported. Hell, there are so many variations of the chat-ml template floating around with no indication which is the correct version.

I mentioned on the other ticket that there is: https://github.com/jinja2cpp/Jinja2Cpp

Maybe that can be an optional component to add support for chat templates from the tokenizer, and hard coding can be the default code-path, I understand not wanting to add additional dependencies.

Getting the jinja string in the client is not helpful as an API endpoint, unless there is a client side compatibility layer between the chat/completions and completions endpoint.

I had opened a issue for chat template support a while ago, when I started working on it for vLLM: https://github.com/ggerganov/llama.cpp/issues/3810

I implemented this for vLLM, and after going through a few rounds of testing, I had to rework things up and add additional parameters, and cli arguments to support the API properly. We should very much stay on the same page for our implementations.

Here is the diff for my chat/completions endpoint changes: https://github.com/vllm-project/vllm/pull/1756/files#diff-38318677b76349044192bf70161371c88fb2818b85279d8fc7f2c041d83a9544

The important points from the vLLM pull request:

1. Addition of the `--chat-template` command-line argument to specify a chat template file or single-line template for the model.
2. Implementation of the `--response-role` command-line argument for defining the role name in chat responses when `add_generation_prompt` is set to true.
3. Update to the chat API request handling to support finishing a partial response correctly, and echoing input portions of messages (request.add_generation_prompt, and request.echo).

The request.echo is an extension of the api, due to the nature of Open Source LLMs being able to finish the last role:content pair in the messages list if request.add_generation_prompt=false (which is also an extension of the API due to the need to support this HF feature) and the template/model support that feature.

We should treat add_generation_prompt as default=true, because that is the behavior of the OpenAI API. This simply allows users to override that behavior if they need it, and gives them all the required knobs to use the feature properly.

mudler commented 12 months ago

Regarding chat templates: I see they are using something called Jinja.

We are not going to implement Jinja in llama.cpp.

The best we can do are 2 things:

* add an API endpoint for the clients to get the Jinja string and do whatever they want with it

* add hardcoded implementations of common templates, where we string match the template and if it is something we know, we call a simple C++ function to tokenize as expected

my personal thoughts here, but probably C++ ain't the best language for that - templating is quite easy to implement in scripted languages rather than C++, and in my opinion would undermine the maintenance and flexibility to have a lean server.cpp implementation.

Just my 2c, but maybe templating fits better on top of llama-cpp-python - which might be easier to go and to maintain (while keeping the core small and extensible)?

ggerganov commented 12 months ago

@Tostino

All templates that I've seen so far are so basic that I don't understand why we need an entire scripting language to express them. Is there a more advanced use case other than a basic for loop over the messages + add prefix/suffix?

How many templates do we expect to ever have? 10s, 100s? Even if it is 1000s, I prefer to have them hardcoded instead of building jinja2cpp (it takes 10 minutes !! to just run cmake config)

Here is sample ChatML template in a few lines of C++ that we currently use (and this is not even the best way to do it):

std::string format_chatml(std::vector<json> messages)
{
    std::ostringstream chatml_msgs;

    for (auto it = messages.begin(); it != messages.end(); ++it) {
        chatml_msgs << "<|im_start|>"
                    << json_value(*it, "role",    std::string("user")) << '\n';
        chatml_msgs << json_value(*it, "content", std::string(""))
                    << "<|im_end|>\n";
    }

    chatml_msgs << "<|im_start|>assistant" << '\n';

    return chatml_msgs.str();
}

I could be missing something, but for the moment I don't see a good reason to add Jinja support. Let's see how it goes - I'm open to reconsider, but need to see some reasonable examples and use cases that justify this dependency.


The request.echo is an extension of the api, due to the nature of Open Source LLMs being able to finish the last role:content pair in the messages list if request.add_generation_prompt=false (which is also an extension of the API due to the need to support this HF feature) and the template/model support that feature.

We should treat add_generation_prompt as default=true, because that is the behavior of the OpenAI API. This simply allows users to override that behavior if they need it, and gives them all the required knobs to use the feature properly.

I think I understand the request.add_generation_prompt parameter, but I don't understand request.echo - can you clarify / give an example?

@mudler

Yes, I agree.

Tostino commented 12 months ago

The fact is, if the rest of the ecosystem standardizes on these templates being "the way" to format messages, it will proliferate to new and unexpected use cases.

python3 -m vllm.entrypoints.openai.api_server --model teknium/OpenHermes-2.5-Mistral-7B --chat-template ./examples/template_inkbot.jinja

Here is an example call using my inkbot template which uses echo:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "teknium/OpenHermes-2.5-Mistral-7B",
    "stream": false,
    "stop": ["\n<#bot#>","\n<#user#>"],
    "add_generation_prompt": false,
    "echo": true,
    "temperature": 0.0,
    "n": 1,
    "messages": [
    {"role": "meta-current_date", "content": "2023-10-20"},
    {"role": "meta-task_name", "content": "general"},
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hello, how are you?"},
    {"role": "user", "content": "Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions. Please list out a few moderate"}    
    ]
  }'

Which returns:

{"id":"cmpl-bb73e8eefb164c3194bb2b450369e1c6","object":"chat.completion","created":195778,"model":"teknium/OpenHermes-2.5-Mistral-7B","choices":[{"index":0,"message":{"role":"user","content":"Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions. Please list out a few moderate to difficult hikes in the area."},"finish_reason":"stop"}],"usage":{"prompt_tokens":107,"total_tokens":121,"completion_tokens":14}}

vs with "echo": false:

{"id":"cmpl-86ba4dd235a84b8e9a7361b46b04ac79","object":"chat.completion","created":195723,"model":"teknium/OpenHermes-2.5-Mistral-7B","choices":[{"index":0,"message":{"role":"user","content":" to difficult hikes in the area."},"finish_reason":"stop"}],"usage":{"prompt_tokens":107,"total_tokens":121,"completion_tokens":14}}

Since the official OpenAI API for chat/completions doesn't allow you to complete an incomplete message, there was no point for them to implement echo in the chat/completions endpoint. The HF chat_template spec explicitly supports that feature with the add_generation_prompt parameter, so it made sense to implement echo for ease of use. It is an extension of the API, which is why I was calling it out though. I tried to choose the most likely behavior / keywords if OpenAI ever did expand their API to add echo.

Edit: Yeah, 10 min for a cmake is painful... Unsure what the best way forward is to be honest. But without actual support for the chat template that the model creator defined, this isn't usable for me (and many others).

FSSRepo commented 12 months ago

In my opinion, most of these projects based on ggml have the characteristic of being very lightweight with few dependencies (headers library: httplib.h json.hpp stb_image.h and others), making them portable compared to having to download a 2 GB library like PyTorch and the entire Python environment that downloads packages that will never be used.

Adding overly heavy dependencies, especially those dependent on an external language like Python, seems to go against the idea of this type of project.

Tostino commented 12 months ago

In my opinion, most of these projects based on ggml have the characteristic of being very lightweight with few dependencies (headers library: httplib.h json.hpp stb_image.h and others), making them portable compared to having to download a 2 GB library like PyTorch and the entire Python environment that downloads packages that will never be used.

Adding overly heavy dependencies, especially those dependent on an external language like Python, seems to go against the idea of this type of project.

Absolutely no one is advocating for a whole pytorch dependency chain. There just may be other options for running the jinja that don't bloat the dependency chain too badly, and I very much think it's worth discussing further to see if there is an acceptable solution that can be found.

Even if it's something like transpiling jinja to another language that we can directly run, or providing hooks for users to run a python interpreter and the jinja dependency to give the results back to the main cpp program. That way it can be optional, and fall back to hard coded options if unavailable.

Just some thoughts, take them for what you will, I am not a cpp dev.

FSSRepo commented 12 months ago

I would suggest something like creating a small utility that performs the functionality we are interested in using C++ (porting it).

Analyzing the Jinja2cpp library quickly, it has Boost as a dependency, which explains the long CMake configuration time. It could be beneficial to decouple that library and include only the necessary functions for Jinja2cpp to work, making it more lightweight.

psugihara commented 12 months ago

@tobi completely agree that server.cpp should be a first-class focus of this repo. My macOS app uses exactly the architecture you describe, hitting server on localhost. I would note however that iOS apps cannot include executables so server.cpp won't work in at least that case. Tangentially, it might make sense to pull some of the common completion/tokenizing/batching/parallelization functionality being added to server.cpp into the llama.cpp core so that each platform doesn't have to rewrite completion_loop, etc.


I also wanted to throw in an example of some ugly code I'd love to kill with built-in server.cpp templating. I'm guessing every server.cpp client has some version of this and I'm sure they all have slightly different bugs: https://github.com/psugihara/FreeChat/blob/main/mac/FreeChat/Models/NPC/PromptTemplates/Templates.swift

@Tostino After understanding more of the background here, I agree that ideally we'd want to support the jinja templates included in GGUFs. I didn't even know these were added to GGUF, that's so cool! Unfortunately I'm not seeing a ton of existing work in cpp besides the relatively heavyweight jinja2cpp you found as well. Implementing a minimal jinja2 parser seems out of scope for v1 of template support but perhaps a more incremental compromise could work...

1) add an endpoint for retrieving the jinja template, allowing clients to skip parsing the gguf themselves if they want to run the template directly 2) this endpoint could indicate whether the template is supported by server.cpp itself (server.cpp could hardcode a cpp template implementation + hash of a corresponding jinja template for example). 3) when requesting a chat completion, the client could indicate whether they've already templated their input

I agree with @ggerganov that the templates are pretty trivial to implement in c++ or whatever and I'd first and foremost just like to have them all in one place (ideally llama.cpp) rather than bespoke implementations in each client. A mapping from jinja template hashes to c++ functions would be the most performant setup too, even if it's a bit ugly conceptually.

If templates are added here, I can delete my implementation in FreeChat so we'll have net 0 fragmentation :)

Tostino commented 12 months ago

when requesting a chat completion, the client could indicate whether they've already templated their input

That isn't possible. You can template your response on the client side, but then you need to hit the legacy completion endpoint, because the payload for chat/completion doesn't support a formatted string, just a list of messages with role/content.

psugihara commented 12 months ago

then you need to hit the legacy completion endpoint

For my use-case that would be fine. Though it does look like there are some non-standard args supported by server's chat/completion already (e.g. mirostat).

wizzard0 commented 12 months ago

May I add my 2c?

I'd very much prefer to keep OAI -> llama and tokens -> tokens parts separate and convert on the "proxy". IMHO Jinja is a terrible middle ground which is both complicated and not flexible enough. See below for examples.

What would be useful on the server.cpp side is more APIs useful for the "OAI -> llama" converter service:

I can't imagine trying to cram all the hacks I've found useful on the "OAI -> llama" side into the C++ binary without it devolving into the unmaintanable mess.

Some examples:

  1. Models are very finicky re chat templates. Often adding/removing whitespace, BOS/EOS etc, using a different template, repeating the system prompt etc improves the output drastically (tested with temperature 0 and fixed seed ofc)
  2. so I often find myself generating multiple outputs with different templates/samplers (!) and returning a single result later
  3. or re-prompting to return a single long reply on small-context models
  4. server-side templating is a black box, and watching the raw formatted prompt is very very useful. Eg if you're trying to debug why the grammar doesnt match anything
  5. Another hack I've found useful is to combine chat templating with a custom prefix for the last response if the reply without that prefix is bad, and converting the results with a custom code
  6. ...in particular, the grammar + pre/post-conversion enables quite OK function calling. But it's prompt-dependent as well, so even the full Jinja wont help, not to mention debugging.
cztomsik commented 12 months ago

you need to hit the legacy completion endpoint

I am doing exactly that in my project, everything is "client-side" because I can then easily "complete" messages (the model refuses to answer, so I edit the message to start with "Sure, here's how to" and let the model fill-in the rest)

And I know nobody asked, but adding Jinja to cpp project is a terrible idea

cztomsik commented 12 months ago

BTW: here's mini-mustache I've implemented for my tool (including a var parser which can be used for auto-showing input fields): Here's video of how it can be used for non-chat purposes. https://twitter.com/cztomsik/status/1722741486641393676

/**
 * Simple mustache-like template engine:
 * - no loops, partials, lambdas, etc.
 * - just variables, sections and inverted sections
 * (c) 2023 Kamil Tomšík, MIT License
 */
export const template = (tpl: string, data: Record<string, any>) => {
  let depth = 0
  let stack = [{ key: "", inverted: false, inner: "" }]

  for (const [_, prev, op, key] of tpl.matchAll(/([\s\S]*?)(?:{{\s*(#|\^|\/)(.*?)\s*}}|$)/g)) {
    stack[depth].inner += prev.replace(/{{\s*(.*?)\s*}}/g, (_, k) => data[k] ?? "")

    if (op === "/") {
      if (key != stack[depth].key) {
        throw new Error()
      }

      const section = stack.pop()!
      stack[--depth].inner += (section.inverted ? !data[key] : data[key]) ? section.inner : ""
      continue
    }

    if (op) {
      stack[++depth] = { key, inverted: op === "^", inner: "" }
    }
  }

  if (depth !== 0 || stack.length > 1) {
    throw new Error()
  }

  return stack[depth].inner
}

/**
 * Extracts all variables from a template
 */
export const parseVars = (tpl: string) => [...new Set(Array.from(tpl.matchAll(/{{\s*(?:#|\^|\/)?(.*?)\s*}}/g), m => m[1]))]

// tests
import test from "node:test"
import assert from "node:assert"
import { template, parseVars } from "./template"

test("template", () => {
  assert.strictEqual(template("{{foo}}", { foo: "bar" }), "bar")
  assert.strictEqual(template("{{foo}}", { foo: 0 }), "0")
  assert.strictEqual(template("{{foo}}", { foo: null }), "")
  assert.strictEqual(template("{{foo}}", { foo: undefined }), "")
  assert.strictEqual(template("{{ foo }}", { foo: "bar" }), "bar")

  assert.strictEqual(template("{{#foo}}bar{{/foo}}", { foo: true }), "bar")
  assert.strictEqual(template("{{#foo}}bar{{/foo}}", { foo: false }), "")

  assert.strictEqual(template("{{^foo}}bar{{/foo}}", { foo: true }), "")
  assert.strictEqual(template("{{^foo}}bar{{/foo}}", { foo: false }), "bar")

  assert.strictEqual(template("{{#a}}foo{{#b}}bar{{/b}}{{/a}}", { a: true, b: true }), "foobar")
  assert.strictEqual(template("{{#a}}foo{{#b}}bar{{/b}}{{/a}}", { a: true, b: false }), "foo")
  assert.strictEqual(template("{{#a}}foo{{#b}}bar{{/b}}{{/a}}", { a: false, b: true }), "")

  const tpl = `
    Hello {{value}}!
    {{#cond}} {{value}}{{/cond}}{{^cond}} Fallback{{/cond}}
  `

  assert.strictEqual(
    template(tpl, { value: "World", cond: true }).trim(),
    `Hello World!
     World`.trim()
  )

  assert.strictEqual(
    template(tpl, { value: "World", cond: false }).trim(),
    `Hello World!
     Fallback`.trim()
  )
})

test("parseVars", () => {
  assert.deepStrictEqual(parseVars("foo"), [])
  assert.deepStrictEqual(parseVars("{{foo}}"), ["foo"])
  assert.deepStrictEqual(parseVars("{{foo}} {{bar}}"), ["foo", "bar"])
  assert.deepStrictEqual(parseVars("{{ foo }}"), ["foo"])
  assert.deepStrictEqual(parseVars("{{#a}}foo{{#b}}bar{{/b}}{{/a}}"), ["a", "b"])
  assert.deepStrictEqual(parseVars("{{^a}}foo{{^b}}bar{{/b}}{{/a}}"), ["a", "b"])
})

Feel free to use it (either as is or as a base)

ruped commented 12 months ago

How many templates do we expect to ever have? 10s, 100s? Even if it is 1000s, I prefer to have them hardcoded instead of building jinja2cpp (it takes 10 minutes !! to just run cmake config)

Every single LoRA can have it's own template. e.g. at our company we fine tune our own models and have a unique parameterized prompt templates (e.g. instruction = Summarise the context., context = In the beginning..., reference = Book, Page 1 to 7, response-style = polite-informal). LLM Chat models can be extremely sensitive (even one misplaced whitespace character in the template can degrade them) and often models don't follow the conventions from other models perfectly - so the user needs to be able to observe the generated prompt for debugging.

Right now I prefer doing all of the templating in our codebase outside of llama.cpp (using the same code that created the training data) and keeping llama.cpp server oblivious about our template. I guess templating in llama.cpp would be optional, so this would still be possible if templating is implemented.

teleprint-me commented 11 months ago

@ggerganov

I'd like to share some thoughts on the templating discussion for llama.cpp. In my view, adding extensive templating support that shifts responsibility away from users could lead to significant maintenance challenges. It's essential to consider the scalable management of such features.

One key aspect to remember is that OpenAI manages a single template, ChatML. In contrast, the open-source community could potentially generate a vast array of templates, many of which might be unexplored or undefined at present. It's a situation reminiscent of Hilbert's paradox of the Grand Hotel, where accommodating 'just one more guest' leads to an infinitely expanding task.

Therefore, I believe llama.cpp should focus on supporting basic templates inherent to the base llama models and empower users to implement their own specific templates. This approach provides a balance, maintaining simplicity within llama.cpp while offering flexibility for user customization.

Regarding fine-tuning with custom templates, I can foresee additional complexities. If a user fine-tunes a model with a unique template, integrating that into llama.cpp could become cumbersome. In essence, we risk creating a bottleneck and potentially restricting the adaptability and control users have over their models.

In terms of implementation, I suggest leaning towards simplicity. Utilizing RFC8259 for JSON object serialization is one viable path. It offers a streamlined approach by serializing JSON Objects into and from strings following a defined grammar. This method aligns with the principle of keeping things straightforward and manageable.

Regarding Jinja2, while I've found it to be a powerful tool in web development, its complexity seems unnecessary for llama.cpp. Jinja2 is designed for HTML templating in web applications, which is quite a different context from what we're dealing with here.

Lastly, the implementation of tokenizer.chat_template in GGUF, which can store templates, appears to be a practical solution already in place. It allows users to handle template management upstream, aligning with the notion of keeping llama.cpp more focused and less burdened by the complexities of internal template management.

In conclusion, my advocacy is for a system that prioritizes ease of use, flexibility, and user empowerment. By keeping llama.cpp streamlined and delegating more specialized functions like templating to the users, we can maintain a balance between functionality and simplicity.

@ggerganov, @wizzard0, @FSSRepo, @mudler, your insights have been invaluable in shaping this perspective.

Full disclosure: Yes, I used GPT-4 to help me clarify my position. It helped filter out my biases while focusing on the essential points in my argument. All it did was revise my original comment. I made any necessary modifications afterwards.

tom-adsfund commented 11 months ago

More options and stability around prompt (and generation) caching would be very useful. For example, maximum cache size, and periodic (n-token) caching.

tobi commented 11 months ago

I think the template problem can be solved quite elegantly by adding an additional step to the convert.py: parse the chat template with its Python library, walk the resulting AST in python, and emit it as a mini vm opcodes or s-exp that can easily be implemented on the cpp side. We can then store those templates alongside the Python chat_template medadata.

The amount of needed instructions is pretty small. You can pull the top 100 tokenizer configs from huggingface and get every possible template that we will likely ever see. With those you can basically code gen unit tests to ensure that Python and cpp produce the same string. Quite a fun project!

teleprint-me commented 11 months ago

@tobi

The conversion scripts haven't implemented it yet, but when the model is converted, if the template is available, then it's embedded into the model. All you would need to do is extract it from the model. There's no need to do anything else.

Example here.

All you'd need to do is extract it from the chat template and then add it to LLM_KV_NAMES. Exporting the template is already added to GGUF.

Tostino commented 11 months ago

@tobi I like the sound of that approach personally, it seems like something like that would fit a whole lot better with this project than a dependency to get jinja working directly.

You can pull the top 100 tokenizer configs from huggingface and get every possible template that we will likely ever see.

I promise this is not going to be true though. There will be tons of private models created by companies that will never be released, but still need an environment to run properly. That may not be the primary purpose of this project, but it is a valid use case.

@teleprint-me What is stored in the gguf file is the jinja template though, and the issue is that cpp has no good way of executing jinja directly. @tobi suggested something that would solve how to execute an exactly equivalent converted version of the template.

teleprint-me commented 11 months ago

@Tostino

The chat template string is stored in the model; I'm assuming. It hasn't been defined the source code yet. Someone that understands the exact plans behind this can chime in. I've been piecing it together over the last few days on my own.

Maybe there's a way to make everyone happy. There's always room for compromise. Personally, I'm not a fan of using Jinja2. It just so happens that's what Hugging Face used and there's been a high demand for it.

I've been advocating against Jinja2. I think embedding the template into the model is a good idea regardless. We just need a clear idea of the structure to embed into it.

cztomsik commented 11 months ago

parse the chat template with its Python library, walk the resulting AST in python, and emit it as a mini vm opcodes or s-exp that can easily be implemented on the cpp side

you also need compatible environment, for example this wouldn't work without re-implementing some python semantics:

But maybe we don't need this?
{{ 1 in [1, 2, 3] }}
tobi commented 11 months ago

the llama2 chat model is sort of a worst case scenario.

chat_template = """{% for message in messages %}
    {% if message['role'] == 'user' %}
        {{ bos_token + '[INST] ' + message['content'] + ' [/INST]' }}
    {% elif message['role'] == 'system' %}
        {{ '<<SYS>>\\n' + message['content'] + '\\n<</SYS>>\\n\\n' }}
    {% elif message['role'] == 'assistant' %}
        {{ ' '  + message['content'] + ' ' + eos_token }}
    {% endif %}
{% endfor %}"""

from jinja2.sandbox import ImmutableSandboxedEnvironment
jinja_env = ImmutableSandboxedEnvironment(trim_blocks=True, lstrip_blocks=True)
parsed_content = jinja_env.parse(chat_template)

if you print the ast here you get:

For: For(target=Name(name='message', ctx='store'), iter=Name(name='messages', ctx='load'), body=[If(test=Compare(expr=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='role'), ctx='load'), ops=[Operand(op='eq', expr=Const(value='user'))]), body=[Output(nodes=[TemplateData(data='        '), Add(left=Add(left=Add(left=Name(name='bos_token', ctx='load'), right=Const(value='[INST] ')), right=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='content'), ctx='load')), right=Const(value=' [/INST]')), TemplateData(data='\n')])], elif_=[If(test=Compare(expr=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='role'), ctx='load'), ops=[Operand(op='eq', expr=Const(value='system'))]), body=[Output(nodes=[TemplateData(data='        '), Add(left=Add(left=Const(value='<<SYS>>\n'), right=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='content'), ctx='load')), right=Const(value='\n<</SYS>>\n\n')), TemplateData(data='\n')])], elif_=[], else_=[]), If(test=Compare(expr=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='role'), ctx='load'), ops=[Operand(op='eq', expr=Const(value='assistant'))]), body=[Output(nodes=[TemplateData(data='        '), Add(left=Add(left=Add(left=Const(value=' '), right=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='content'), ctx='load')), right=Const(value=' ')), right=Name(name='eos_token', ctx='load')), TemplateData(data='\n')])], elif_=[], else_=[])], else_=[])], else_=[], test=None, recursive=False))
Name: Name(name='message', ctx='store'))
Name: Name(name='messages', ctx='load'))
If: If(test=Compare(expr=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='role'), ctx='load'), ops=[Operand(op='eq', expr=Const(value='user'))]), body=[Output(nodes=[TemplateData(data='        '), Add(left=Add(left=Add(left=Name(name='bos_token', ctx='load'), right=Const(value='[INST] ')), right=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='content'), ctx='load')), right=Const(value=' [/INST]')), TemplateData(data='\n')])], elif_=[If(test=Compare(expr=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='role'), ctx='load'), ops=[Operand(op='eq', expr=Const(value='system'))]), body=[Output(nodes=[TemplateData(data='        '), Add(left=Add(left=Const(value='<<SYS>>\n'), right=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='content'), ctx='load')), right=Const(value='\n<</SYS>>\n\n')), TemplateData(data='\n')])], elif_=[], else_=[]), If(test=Compare(expr=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='role'), ctx='load'), ops=[Operand(op='eq', expr=Const(value='assistant'))]), body=[Output(nodes=[TemplateData(data='        '), Add(left=Add(left=Add(left=Const(value=' '), right=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='content'), ctx='load')), right=Const(value=' ')), right=Name(name='eos_token', ctx='load')), TemplateData(data='\n')])], elif_=[], else_=[])], else_=[]))
Compare: Compare(expr=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='role'), ctx='load'), ops=[Operand(op='eq', expr=Const(value='user'))]))
Getitem: Getitem(node=Name(name='message', ctx='load'), arg=Const(value='role'), ctx='load'))
Name: Name(name='message', ctx='load'))
Const: Const(value='role'))
Operand: Operand(op='eq', expr=Const(value='user')))
Const: Const(value='user'))
Output: Output(nodes=[TemplateData(data='        '), Add(left=Add(left=Add(left=Name(name='bos_token', ctx='load'), right=Const(value='[INST] ')), right=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='content'), ctx='load')), right=Const(value=' [/INST]')), TemplateData(data='\n')]))
TemplateData: TemplateData(data='        '))
Add: Add(left=Add(left=Add(left=Name(name='bos_token', ctx='load'), right=Const(value='[INST] ')), right=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='content'), ctx='load')), right=Const(value=' [/INST]')))
Add: Add(left=Add(left=Name(name='bos_token', ctx='load'), right=Const(value='[INST] ')), right=Getitem(node=Name(name='message', ctx='load'), arg=Const(value='content'), ctx='load')))
Add: Add(left=Name(name='bos_token', ctx='load'), right=Const(value='[INST] ')))
Name: Name(name='bos_token', ctx='load'))
Const: Const(value='[INST] '))
Getitem: Getitem(node=Name(name='message', ctx='load'), arg=Const(value='content'), ctx='load'))
Name: Name(name='message', ctx='load'))
Const: Const(value='content'))
Const: Const(value=' [/INST]'))
TemplateData: TemplateData(data='\n'))

as messy as this is, i do feel like this could be turned into a trimmer sexpression with just a few operators and stored alongside the raw template in the .gguf. the cpp side would be straightforward.

b-mc2 commented 11 months ago

A more cosmetic note but server related is the "poor mans markdown" in server/public/index.html, in particular line 862-865 breaks some python code. It converts opening and closing _ and ** to <strong></strong> and opening and closing and * to <em></em>

This breaks python code like __init__(), exponents **, and snakecase variables. Of course it affects other languages too.

A future fix could be incorporating more robust markdown and syntax highlight code which takes into account the `language` that usually occurs before a LLM provides code blocks.

zakkor commented 11 months ago

Piggybacking off this issue with a question/feature request:

Is it possible to interactively tell the server to stop generating?

For example, a long response is currently being generated, user clicks "stop", so the generation should end.

tom-adsfund commented 11 months ago

@zakkor I'd suggest that if giving a HTTP streaming reply, then it should automatically stop the generation on the close of the connection.

shibe2 commented 11 months ago

if giving a HTTP streaming reply

I think, it would be useful in non-streaming mode too. For example, if it takes too long.

it should automatically stop the generation on the close of the connection.

I remember that when streaming, it stops generation of new tokens, but it does not stop prompt processing.

zakkor commented 11 months ago

I'd suggest that if giving a HTTP streaming reply, then it should automatically stop the generation on the close of the connection.

Oh yeah, in fact it does already work like that, awesome!

tom-adsfund commented 11 months ago

Oh yeah, in fact it does already work like that, awesome!

Great! Should probably be documented then.

xiaoyunwu commented 11 months ago

Just curious, is it possible to support multiple loras with the same base model?

kalomaze commented 11 months ago

Just curious, is it possible to support multiple loras with the same base model?

Technically you can mix and merge multiple loras as you please when merging. Idk about the difficulty of doing so at inference time.

kalomaze commented 11 months ago

https://github.com/ggerganov/llama.cpp/pull/4367

Is there interest in setting a standard for sampling hyperparameters in general? I'm of the personal opinion that:

kalomaze commented 11 months ago

I have discovered that server.cpp sampling seems to be forcibly resizing candidates to 1 no matter what. https://github.com/ggerganov/llama.cpp/issues/4370

tezlm commented 11 months ago

Implementing /v1/completions and /v1/embeddings seems like fairly low hanging fruit

tom-adsfund commented 11 months ago

For caching, you could allow the query to also send a "cache_ttl" (ms) key-value pair to say how long a prompt/generation should be cached for.

pwdonald commented 11 months ago

Is it a known issue that the frontend (at least for mixtral output) seems to eat underscores in code blocks? If I look in the JavaScript console I will see if __name__ == "__main__" for example but in the server llama.cpp frontend it displays as if name == "main". It seems like this is an area for improvement.

shibe2 commented 11 months ago

@pwdonald It is known to eat asterisks, as described in #3723. I changed the issue title to cover underscores too.

stolsvik commented 10 months ago

In 'llamafile', a suggestion to create an OpenAI-compatible /v1/embeddings endpoint has been voiced: https://github.com/Mozilla-Ocho/llamafile/issues/171