llama.cpp

This is a wrapper of llama.cpp implemented as per the discussion Integration of llama.cpp and whisper.cpp:

Use the llama.cpp C interface in llama.h
Reimplement the common library

As mentioned in the discussion the (maybe distant) future plan is to ditch llama.cpp by reimplementing entirely with vanilla ggml and a C++ interface.

Reimplementation Notes:

Better error handling, please
GGUF metadata access (llama_model_meta_*) is not great. We should provide a better interface
llama_chat_apply_template does not handle memory allocation optimally. There's a lot of room for improvement
- as a whole, chat management is not very efficient. llama_chat_format_single doing a full chat format for a single message is terrible
Chat templates can't be used to escape special tokens. If the user actually enters some, this just messes-up the resulting formatted text.
Give vocab more visibility
Token-to-text can be handled much more elegantly by using plain ol' string_view instead of copying strings. It's not like tokens are going to be modified once the model is loaded
- If we don't reimplement, perhaps keeping a parallel array of all tokens to string would be a good idea
llama_batch being used for both input and output makes it hard to propagate the constness of the input buffer. This leads to code having to use non-const buffers, even if we know they're not going to be modified. We should bind the buffer constness to the batch struct itself.
The low-level llama context currently takes a rng seed (which is only used for mirostat sampling). A reimplemented context should be deterministic. If an operation requires random numbers, a generator should be provided from the outside.
- For now we will hide the mirostat sampling altogether and ditch the seed
As per this discussion we should take into account how we want to deal with asset storage and whether we want to abstract the i/o away.

alpaca-core / ilib-llama.cpp

readme

llama.cpp

Reimplementation Notes: