allenporter / llama-cpp-server

Docker images for easier running of llama-cpp-python server
Apache License 2.0
2 stars 0 forks source link

llama-cpp-server

Docker containers for llama-cpp-python which is an OpenAI compatible wrapper around llama2. The motivation is to have prebuilt containers for use in kubernetes.

Ideally we should just update llama-cpp-python to automate publishing containers and support automated model fetching from urls.

Images

Image Description
ghcr.io/allenporter/llama-cpp-server-simple No special hardware acceleration
ghcr.io/allenporter/llama-cpp-server-openblas OpenBLAS CPU-only based computations
ghcr.io/allenporter/llama-cpp-server-clblast Uses OpenCL for hardware accelleration (can use Intel or other GPUs)
ghcr.io/allenporter/llama-cpp-server-cuda Preferred on NVidia GPUs
ghcr.io/allenporter/llama-cpp-server-model-fetch Helper container for downloading models from URLs

Examples

Below are examples using the container images.

Fetching models

Fetch a model:

$ docker run -it \
    -v "./models/:/data/models:rw" \
    -e "MODEL_URLS=https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf" \
    ghcr.io/allenporter/llama-cpp-server-model-fetch:main

Running a server

See Configuration for more details on the config file format.

$ docker run -it \
    -v "./models/:/data/models" \
    -v "./config/:/data" \
    -e "CONFIG_FILE=/data/config.json" \
    -p "8000:8000" \
    ghcr.io/allenporter/llama-cpp-server-clblast:main

Local development

Build the model fetcher:

$ docker build -t model-fetch:dev model-fetch/

Fetch a model:

$ docker run -it \
    -v "./models/:/data/models:rw" \
    -e "MODEL_URLS=https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf" \
    model-fetch:dev