Closed uogbuji closed 8 months ago
Install the llama.cpp server
executable using make, cmake or whatever method works. There doesn't seem to be a make install
target, so I just did:
mkdir ~/.local/bin/llamacpp
cp server ~/.local/bin/llamacpp
You can then run it against a downloaded GGUF model, e.g.
~/.local/bin/llamacpp/server -m ~/.local/share/models/TheBloke_OpenHermes-2.5-Mistral-7B-16k-GGUF/openhermes-2.5-mistral-7b-16k.Q5_K_M.gguf --host 0.0.0.0 --port 8000 -c 4096 --log-format text --path ~/.local/share/llamacpp/
For 4K context listening globally (default host is loopback-only 127.0.0.1
) on port 8000. llama.cpp server README details the cmdline options.
curl --request POST \
--url http://localhost:8000/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
A chat example. Notice the use of min_p
, which I don't think is possible via OpenAI API.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [
{"role": "system",
"content": "You are an AI assistant. Your top priority is achieving user fulfillment via helping the user with their requests."},
{"role": "user",
"content": "Write a limerick about python exceptions"}
], "min_p": 0.05}'
It informally feels a lot faster than running curls on locally-hosted llama-cpp-python (OpenAI API).
Examples of API usage:
import asyncio; from ogbujipt.llm_wrapper import prompt_to_chat, llama_cpp_http_chat
llm_api = llama_cpp_http_chat('http://localhost:8000')
resp = asyncio.run(llm_api(prompt_to_chat('Knock knock!'), min_p=0.05))
llm_api.first_choice_message(resp)
Chat:
import asyncio; from ogbujipt.llm_wrapper import llama_cpp_http
llm_api = llama_cpp_http('http://localhost:8000')
resp = asyncio.run(llm_api('Knock knock!', min_p=0.05))
resp['content']
Implemented llama.cpp-style API keys. Examples:
import os, asyncio; from ogbujipt.llm_wrapper import prompt_to_chat, llama_cpp_http_chat
LLAMA_CPP_APIKEY=os.environ.get('LLAMA_CPP_APIKEY')
llm_api = llama_cpp_http_chat('http://localhost:8000', apikey=LLAMA_CPP_APIKEY)
resp = asyncio.run(llm_api(prompt_to_chat('Knock knock!'), min_p=0.05))
llm_api.first_choice_message(resp)
Non-chat:
import os, asyncio; from ogbujipt.llm_wrapper import llama_cpp_http
LLAMA_CPP_APIKEY=os.environ.get('LLAMA_CPP_APIKEY')
llm_api = llama_cpp_http('http://localhost:8000', apikey=LLAMA_CPP_APIKEY)
resp = asyncio.run(llm_api('Knock knock!', min_p=0.05))
resp['content']
This work has made it clear to me that the __call__
methods of all the ogbujipt.llm_wrapper
classes should always have been async. Just ripping the bandage off now and making that change 😬
Right now we have 2 flavors of LLM client class:
ogbujipt.llm_wrapper.openai_api
which wraps the OpenAI API andogbujipt.llm_wrapper.ctransformer
which wraps ctransformers for local program space hosting. Add another which wraps the direct-HTTP llama.cpp API.Targeted for 0.8.0 release.