ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.14k stars 8.73k forks source link

llama : create llamax library #5215

Open ggerganov opened 5 months ago

ggerganov commented 5 months ago

Depends on: https://github.com/ggerganov/llama.cpp/issues/5214

The llamax library will wrap llama and expose common high-level functionality. The main goal is to ease the integration of llama.cpp into 3rd party projects. Ideally, most projects would interface through the llamax API for all common use cases, while still have the option to use the low-level llama API for more uncommon applications that require finer control of the state.

A simple way to think about llamax is that it will simplify all of the existing examples in llama.cpp by hiding the low-level stuff, such as managing the KV cache and batching requests.

Roughly, llamax will require it's own state object and a run-loop function.

The specifics of the API are yet to be determined - suggestions are welcome.

ngxson commented 5 months ago

I recently did the same thing on my personal project, so I'd like to share with you the list of functions that I decided to keep on my part:

function purpose
load load the model, also take into account specified cparams, mparams and sparams
lookup_token For looking up special tokens, likes [INST] or <<SYS>>
tokenize convert a string into list of tokens
eval take a list of tokens, then create a batch and evaluate it via llama_decode
decode_logits it's quite bad naming here. The function returns the sampled token from the logits
session_save Save session to a file
session_load Load session
sampling_accept Accept one token
exit free anything then exit

More info on my project: My implementation is basically a web server that takes a JSON as input, for example: { "action": "load", "model_path": "...", ... }. That's why I never (and cannot) returns any pointer. It's more like a "low-level" API that is accessible via web.

wsxiaoys commented 5 months ago

It would be great if we could generalize grammar-based sampling, into a callback function-based approach. This would open up downstream use cases to adjust logic in arbitrary ways. (In Tabby's case, we would really like to integrate a tree-sitter grammar for similar goal).

Something like void (*apply_logis_override)(logits, void* data) should do the work.

ggerganov commented 5 months ago

Ok, thanks for the suggestions - these are useful.

I'm thinking the API should also support the multi-sequence batched use case, where the user can dynamically insert new requests for processing (something like the current slots in the server example, but better). In that sense, calls such as eval, decode_logits won't be very suitable. Something more like:

llamax_context * ctx = llamax_context_init(...);

// thread 0 or 1
ctx->add_request(...);
ctx->add_request(...);
...

// main thread
while (true) {
    ctx->process(...);
}

llamax_context_free(ctx);
ngxson commented 5 months ago

Yeah I haven't yet consider about multi-sequence on my implementation.

As the first step, I designed my API to be readable from top to bottom, something like:

load(...)
eos_token = lookup_token("</s>")
input_tokens = tokenize("Hello, my name is")
eval(tokens)
while True:
  next_token, next_piece = decode_logits()
  if next_token == eos_token:
    break
  print(next_piece)
  sampling_accept(next_token)
  eval([next_token])
exit()

With multi-sequence, it may become:

...
input_tokens = tokenize("Hello, my name is")
seq_id = new_seq() # added
eval(tokens, seq_id) # add seq_id
while True:
  next_token, next_piece = decode_logits()
  if next_token == eos_token:
    break
  print(next_piece)
  sampling_accept(next_token, seq_id) # add seq_id
  eval([next_token], seq_id) # add seq_id
delete_seq(seq_id)
exit()

Would be nice if llamax can be thread-safe. For example, on the code above:

@wsxiaoys I'm not quite sure if modifying logits is suitable for high-level API or not, but maybe llamax can just expose the underlaying llama_context, then you can use low-level API to interact with the low-level context.

AshD commented 4 months ago

I think this is a great idea.

Currently, I am using llama.cpp with LlamaSharp but it does not work with the latest version of llama.cpp because of llama.cpp changes. Ideally, I would like to drop the latest llama.dll directly into my .NET project.

It would be great to have a Super High level API that does NOT have breaking changes, something like a subset of the llama.cpp Python API https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#high-level-api

Super High level API Contract that does not change version to version - I think this should cover 90% of use cases

There can be another second level of API like Tokenize, etc. and a low-level API.

ngxson commented 4 months ago

@AshD yeah right, seems like the create_chat_completion in llama-cpp-python is the sweet spot between using llama.cpp as "server" and as "library"

The chat_format param used by create_chat_completion can replace my lookup_token. The reason I have lookup_token in my implementation was because not all models use chatml format.

However, the create_chat_completion creates a tricky situation where I may want to cache the prompt. Prompt caching can be useful when you have a very long system prompt.

AshD commented 4 months ago

@ngxson I was thinking of Prompt Caching. Our app Fusion Quill, calls to Llama.cpp are either Chat type calls or One off calls for things like Summarization.

For the Chat use case, the messages list is [system, usermsg1] for the 1st call, then [system,usermsg1, assistant1, usermsg2, ...] Maybe llama.cpp can cache the tokens for the messages that it has seen.

For the other use case, caching the tokens for the system message will make sense.

This way the Super High Level api is kept simple.

ngxson commented 4 months ago

@AshD yeah I actually have a bonus idea, but haven't got time to implement it:

In the chat api, some systems may remove the oldest messages to be able to fit the history into context window. On the server side, we can detect this change then use llama_kv_cache_seq_rm and llama_kv_cache_seq_shift to shift the KV cache instead of re-calculating it.

This kind of behavior already exists in main, but is done on token level instead of "message" level.

My idea is to detect the change and calculate the number of KV to shift just by comparing list of messages from the last request vs the new request. This is just plain logic codes and nothing to do with inference though.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity.

cebtenzzre commented 3 months ago

Not stale.

AshD commented 3 months ago

I hope someone picks this up soon. Our app, Fusion Quill uses llama.cpp via LlamaSharp. There a big time lag from the time llama.cpp supports a new model to LlamaSharp supporting the new version of Llama.cpp

With a stable high level API, this problem should go away and it would simplify downstream llama.cpp libraries.

amakropoulos commented 3 months ago

Thanks a lot for this awesome project! Just to +1 that this feature would be tremendously helpful!