🚧 being rewritten from Python to Rust/WebAssembly, see details https://github.com/chenhunghan/ialacol/pull/93
ialacol (pronounced "localai") is a lightweight drop-in replacement for OpenAI API.
It is an OpenAI API-compatible wrapper ctransformers supporting GGML/GPTQ with optional CUDA/Metal acceleration.
ialacol is inspired by other similar projects like LocalAI, privateGPT, local.ai, llama-cpp-python, closedai, and mlc-llm, with a specific focus on Kubernetes deployment.
See Receipts below for instructions of deployments.
And all LLMs supported by ctransformers.
ialacol
does not have a UI, however it's compatible with any web UI that support OpenAI API, for example chat-ui after PR #541 merged.
Assuming ialacol
running at port 8000, you can configure chat-ui to use zephyr-7b-beta.Q4_K_M.gguf
served by ialacol
.
MODELS=`[
{
"name": "zephyr-7b-beta.Q4_K_M.gguf",
"displayName": "Zephyr 7B β",
"preprompt": "<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate.</s>\n",
"userMessageToken": "<|user|>\n",
"userMessageEndToken": "</s>\n",
"assistantMessageToken": "<|assistant|>\n",
"assistantMessageEndToken": "\n",
"parameters": {
"temperature": 0.1,
"top_p": 0.95,
"repetition_penalty": 1.2,
"top_k": 50,
"max_new_tokens": 4096,
"truncate": 999999
},
"endpoints" : [{
"type": "openai",
"baseURL": "http://localhost:8000/v1",
"completion": "chat_completions"
}]
}
]
MODELS=`[
{
"name": "openchat_3.5.Q4_K_M.gguf",
"displayName": "OpenChat 3.5",
"preprompt": "",
"userMessageToken": "GPT4 User: ",
"userMessageEndToken": "<|end_of_turn|>",
"assistantMessageToken": "GPT4 Assistant: ",
"assistantMessageEndToken": "<|end_of_turn|>",
"parameters": {
"temperature": 0.1,
"top_p": 0.95,
"repetition_penalty": 1.2,
"top_k": 50,
"max_new_tokens": 4096,
"truncate": 999999,
"stop": ["<|end_of_turn|>"]
},
"endpoints" : [{
"type": "openai",
"baseURL": "http://localhost:8000/v1",
"completion": "chat_completions"
}]
}
]`
ialacol
offer first class citizen support for Kubernetes, which means you can automate/configure everything compare to runing without.
To quickly get started with ialacol on Kubernetes, follow the steps below:
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama-2-7b-chat ialacol/ialacol
By defaults, it will deploy Meta's Llama 2 Chat model quantized by TheBloke.
Port-forward
kubectl port-forward svc/llama-2-7b-chat 8000:8000
Chat with the default model llama-2-7b-chat.ggmlv3.q4_0.bin
using curl
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "How are you?"}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false}' \
http://localhost:8000/v1/chat/completions
Alternatively, using OpenAI's client library (see more examples in the examples/openai
folder).
openai -k "sk-fake" \
-b http://localhost:8000/v1 -vvvvv \
api chat_completions.create -m llama-2-7b-chat.ggmlv3.q4_0.bin \
-g user "Hello world!"
All configuration is done via environmental variable.
Parameter | Description | Default | Example |
---|---|---|---|
DEFAULT_MODEL_HG_REPO_ID |
The Hugging Face repo id to download the model | None |
TheBloke/orca_mini_3B-GGML |
DEFAULT_MODEL_HG_REPO_REVISION |
The Hugging Face repo revision | main |
gptq-4bit-32g-actorder_True |
DEFAULT_MODEL_FILE |
The file name to download from the repo, optional for GPTQ models | None |
orca-mini-3b.ggmlv3.q4_0.bin |
MODE_TYPE |
Model type to override the auto model type detection | None |
gptq , gpt_bigcode , llama , mpt , replit , falcon , gpt_neox gptj |
LOGGING_LEVEL |
Logging level | INFO |
DEBUG |
TOP_K |
top-k for sampling. | 40 |
Integers |
TOP_P |
top-p for sampling. | 1.0 |
Floats |
REPETITION_PENALTY |
rp for sampling. | 1.1 |
Floats |
LAST_N_TOKENS |
The last n tokens for repetition penalty. | 1.1 |
Integers |
SEED |
The seed for sampling. | -1 |
Integers |
BATCH_SIZE |
The batch size for evaluating tokens, only for GGUF/GGML models | 8 |
Integers |
THREADS |
Thread number override auto detect by CPU/2, set 1 for GPTQ models |
Auto |
Integers |
MAX_TOKENS |
The max number of token to generate | 512 |
Integers |
STOP |
The token to stop the generation | None |
<|endoftext> |
CONTEXT_LENGTH |
Override the auto detect context length | 512 |
Integers |
GPU_LAYERS |
The number of layers to off load to GPU | 0 |
Integers |
TRUNCATE_PROMPT_LENGTH |
Truncate the prompt if set | 0 |
Integers |
Sampling parameters including TOP_K
, TOP_P
, REPETITION_PENALTY
, LAST_N_TOKENS
, SEED
, MAX_TOKENS
, STOP
can be override per request via request body, for example:
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" }' \
http://localhost:8000/v1/chat/completions
will use temperature=2
, top_p=1
and top_k=0
for this request.
There is a image hosted on ghcr.io (alternatively CUDA11,CUDA12,METAL,GPTQ variants).
docker run --rm -it -p 8000:8000 \
-e DEFAULT_MODEL_HG_REPO_ID="TheBloke/Llama-2-7B-Chat-GGML" \
-e DEFAULT_MODEL_FILE="llama-2-7b-chat.ggmlv3.q4_0.bin" \
ghcr.io/chenhunghan/ialacol:latest
For developers/contributors
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
DEFAULT_MODEL_HG_REPO_ID="TheBloke/stablecode-completion-alpha-3b-4k-GGML" DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin" LOGGING_LEVEL="DEBUG" THREAD=4 uvicorn main:app --reload --host 0.0.0.0 --port 9999
Build image
docker build --file ./Dockerfile -t ialacol .
Run container
export DEFAULT_MODEL_HG_REPO_ID="TheBloke/orca_mini_3B-GGML"
export DEFAULT_MODEL_FILE="orca-mini-3b.ggmlv3.q4_0.bin"
docker run --rm -it -p 8000:8000 \
-e DEFAULT_MODEL_HG_REPO_ID=$DEFAULT_MODEL_HG_REPO_ID \
-e DEFAULT_MODEL_FILE=$DEFAULT_MODEL_FILE ialacol
To enable GPU/CUDA acceleration, you need to use the container image built for GPU and add GPU_LAYERS
environment variable. GPU_LAYERS
is determine by the size of your GPU memory. See the PR/discussion in llama.cpp to find the best value.
deployment.image
= ghcr.io/chenhunghan/ialacol-cuda11:latest
deployment.env.GPU_LAYERS
is the layer to off loading to GPU.deployment.image
= ghcr.io/chenhunghan/ialacol-cuda12:latest
deployment.env.GPU_LAYERS
is the layer to off loading to GPU.Only llama
, falcon
, mpt
and gpt_bigcode
(StarCoder/StarChat) support CUDA.
helm install llama2-7b-chat-cuda12 ialacol/ialacol -f examples/values/llama2-7b-chat-cuda12.yaml
Deploys llama2 7b model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.
helm install starcoderplus-guanaco-cuda12 ialacol/ialacol -f examples/values/starcoderplus-guanaco-cuda12.yaml
Deploys Starcoderplus-Guanaco-GPT4-15B-V1.0 model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.
If you see CUDA driver version is insufficient for CUDA runtime version
when making the request, you are likely using a Nvidia Driver that is not compatible with the CUDA version.
Upgrade the driver manually on the node (See here if you are using CUDA11 + AMI). Or try different version of CUDA.
To enable Metal support, use the image ialacol-metal
built for metal.
deployment.image
= ghcr.io/chenhunghan/ialacol-metal:latest
For example
helm install llama2-7b-chat-metal ialacol/ialacol -f examples/values/llama2-7b-chat-metal.yaml.yaml
To use GPTQ, you must
deployment.image
= ghcr.io/chenhunghan/ialacol-gptq:latest
deployment.env.MODEL_TYPE
= gptq
For example
helm install llama2-7b-chat-gptq ialacol/ialacol -f examples/values/llama2-7b-chat-gptq.yaml.yaml
kubectl port-forward svc/llama2-7b-chat-gptq 8000:8000
openai -k "sk-fake" -b http://localhost:8000/v1 -vvvvv api chat_completions.create -m gptq_model-4bit-128g.safetensors -g user "Hello world!"
ialacol
can be use as a copilot client as GitHub's Copilot is almost identical API as OpenAI completion API.
However, few things need to keep in mind:
Copilot client sends a lenthy prompt, to include all the related context for code completion, see copilot-explorer, which give heavy load on the server, if you are trying to run ialacol
locally, opt-in TRUNCATE_PROMPT_LENGTH
environmental variable to truncate the prompt from the beginning to reduce the workload.
Copilot sends request in parallel, to increase the throughput, you probably need a queue like text-inference-batcher.
Start two instances of ialacol:
gh repo clone chenhunghan/ialacol && cd ialacol && python3 -m venv .venv && source .venv/bin/activate && python3 -m pip install -r requirements.txt
LOGGING_LEVEL="DEBUG"
THREAD=2
DEFAULT_MODEL_HG_REPO_ID="TheBloke/stablecode-completion-alpha-3b-4k-GGML"
DEFAULT_MODEL_FILE="stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin"
TRUNCATE_PROMPT_LENGTH=100 # optional
uvicorn main:app --host 0.0.0.0 --port 9998
uvicorn main:app --host 0.0.0.0 --port 9999
Start tib, pointing to upstream ialacol instances.
gh repo clone ialacol/text-inference-batcher && cd text-inference-batcher && npm install
UPSTREAMS="http://localhost:9998,http://localhost:9999" npm start
Configure VSCode Github Copilot to use tib.
"github.copilot.advanced": {
"debug.overrideEngine": "stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin",
"debug.testOverrideProxyUrl": "http://localhost:8000",
"debug.overrideProxyUrl": "http://localhost:8000"
}
LLMs are known to be sensitive to parameters, the higher temperature
leads to more "randomness" hence LLM becomes more "creative", top_p
and top_k
also contribute to the "randomness"
If you want to make LLM be creative.
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" }' \
http://localhost:8000/v1/chat/completions
If you want to make LLM be more consistent and genereate the same result with the same input.
curl -X POST \
-H 'Content-Type: application/json' \
-d '{ "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "0.1", "top_p": "0.1", "top_k": "40" }' \
http://localhost:8000/v1/chat/completions
starcoder
model type via ctransformers, including:
GET /models
and POST /completions
POST /embeddings
backed by huggingface Apache-2.0 embedding models such as Sentence Transformers and hkunlp/instructorDeploy Meta's Llama 2 Chat model quantized by TheBloke.
7B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-7b-chat ialacol/ialacol -f examples/values/llama2-7b-chat.yaml
13B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-13b-chat ialacol/ialacol -f examples/values/llama2-13b-chat.yaml
70B Chat
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-70b-chat ialacol/ialacol -f examples/values/llama2-70b-chat.yaml
Deploy OpenLLaMA 7B model quantized by rustformers.
ℹ️ This is a base model, likely only useful for text completion.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-7b ialacol/ialacol -f examples/values/openllama-7b.yaml
Deploy OpenLLaMA 13B Open Instruct model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-13b-instruct ialacol/ialacol -f examples/values/openllama-13b-instruct.yaml
Deploy MosaicML's MPT-7B model quantized by rustformers. ℹ️ This is a base model, likely only useful for text completion.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-7b ialacol/ialacol -f examples/values/mpt-7b.yaml
Deploy MosaicML's MPT-30B Chat model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-30b-chat ialacol/ialacol -f examples/values/mpt-30b-chat.yaml
Deploy Uncensored Falcon 7B model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-7b ialacol/ialacol -f examples/values/falcon-7b.yaml
Deploy Uncensored Falcon 40B model quantized by TheBloke.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-40b ialacol/ialacol -f examples/values/falcon-40b.yaml
Deploy starchat-beta
model quantized by TheBloke.
helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install starchat-beta ialacol/ialacol -f examples/values/starchat-beta.yaml
Deploy WizardCoder
model quantized by TheBloke.
helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install wizard-coder-15b ialacol/ialacol -f examples/values/wizard-coder-15b.yaml
Deploy light-weight pythia-70m
model with only 70 millions paramters (~40MB) quantized by rustformers.
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install pythia70m ialacol/ialacol -f examples/values/pythia-70m.yaml
Deploy RedPajama
3B model
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install redpajama-3b ialacol/ialacol -f examples/values/redpajama-3b.yaml
Deploy stableLM
7B model
helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install stablelm-7b ialacol/ialacol -f examples/values/stablelm-7b.yaml
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
pip freeze > requirements.txt