Include in Readme how to Pass Custom Arguments to `llama_cpp.server` in Docker

jaredquekjz commented 11 months ago

Title:

Issue with Passing Custom Arguments to llama_cpp.server in Docker

Issue Description:

Hello abetlen,

I've been trying to use your Docker image ghcr.io/abetlen/llama-cpp-python:v0.2.24 for llama_cpp.server, and I encountered some difficulties when attempting to pass custom arguments (--n_gpu_layers 81, --chat_format chatml, --use_mlock False) to the server through Docker.

Steps to Reproduce:

Pull the Docker image: docker pull ghcr.io/abetlen/llama-cpp-python:v0.2.24

Run the container with custom arguments:

docker run --rm -it -p 8000:8000 \
 -v /home/jaredquek/text-generation-webui/models:/models \
 -e MODEL=/models/tulu-2-dpo-70b.Q5_K_M.gguf \
 --entrypoint uvicorn \
 ghcr.io/abetlen/llama-cpp-python:v0.2.24 \
 --factory llama_cpp.server.app:create_app --host 0.0.0.0 --port 8000 --n_gpu_layers 81 --chat_format chatml --use_mlock False

This results in an error: Error: No such option: --n_gpu_layers.

Expected Behavior:

I expected to be able to pass these arguments to the llama_cpp.server application inside the Docker container.

Actual Behavior:

The uvicorn command does not recognize these arguments as it's designed for the ASGI server, not the llama_cpp.server application.

Potential Solutions:

Modify the Dockerfile or application configuration to accept these arguments.
Provide guidance in Readme on how to correctly pass additional arguments or configure the server with these settings.

I would appreciate any assistance or guidance you could provide on this issue.

Thank you for your time and for maintaining this project.

Best regards.

3x3cut0r commented 11 months ago

you could try my container: https://hub.docker.com/r/3x3cut0r/llama-cpp-python

i implemented all supported options to an env variable. tell me what do you think and please tell me any bugs.

jaredquekjz commented 11 months ago

Thanks for ur attention. So I tried the Docker but the GPU isn't being activated even though the uvicorn server starting. This is my Docker run:

    --name llama-cpp-python \
    --cap-add SYS_RESOURCE \
    -v /home/jaredquek/text-generation-webui/models:/models \
    -p 8000:8000/tcp \
    3x3cut0r/llama-cpp-python:latest \
    --model /models/tulu-2-dpo-70b.Q5_K_M.gguf \
    --n_gpu_layers 81 \
    --chat_format chatml \
    --use_mlock False

Does the Docker image run Cuda acceleration by default or I have to do some other thing? Also would you know which parameter to adjust should I wish to handle many concurrent requests through the server? i understand that for the llama cpp server it's done by ngl: https://github.com/ggerganov/llama.cpp/pull/3228. Thanks for your advice!

3x3cut0r commented 11 months ago

Unfortunately, this alpine based image is built with these CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS -DLLAMA_AVX=OFF -DLLAMA_AVX2=OFF -DLLAMA_F16C=OFF -DLLAMA_FMA=OFF". This means it is optimized for older CPUs and GPU support is deactivated.

An image optimized for GPUs with cuda would need a different base image anyway. Unfortunately I don't have an Nvidia GPU myself and therefore can't test or deploy anything. But maybe i can have a look activating gpu support (without cuda). But then llama-cpp-python needs to be recompiled after image creation. or i could deploy i on another tag. need to think about that.

I have also not yet dealt with your question about parallel requests. But I would also be very interested into that too.

Sorry

jaredquekjz commented 11 months ago

Thanks! Perhaps that's true with abetlen's original image too - not for Cuda? Have to look deeply. I have managed to get the non Docker version of the server working already. However of course I prefer the stability of Docker, and need to find out about the parallel requests.

abetlen commented 10 months ago

@jaredquekjz there are two options really

Use environment variables instead of cli args. This is the slightly more idiomatic solution for containers and every cli argument has a corresponding environment variable, so --n_gpu_layers is equivalent to N_GPU_LAYERS.
Change your entrypoint to python in the docker command and run with -m llama_cpp.server followed by the cli args, this should allow you to pass either cli or environment variable arguments.

The benefit to using the default entrypoint and environment variables with the official image is that it includes a compiler and will rebuild the image for any cpu architecture you deploy it to ensuring that it's going to be as fast as or faster than pre-built binaries.

maziyarpanahi commented 8 months ago

@jaredquekjz there are two options really

Use environment variables instead of cli args. This is the slightly more idiomatic solution for containers and every cli argument has a corresponding environment variable, so --n_gpu_layers is equivalent to N_GPU_LAYERS.

Change your entrypoint to python in the docker command and run with -m llama_cpp.server followed by the cli args, this should allow you to pass either cli or environment variable arguments.

The benefit to using the default entrypoint and environment variables with the official image is that it includes a compiler and will rebuild the image for any cpu architecture you deploy it to ensuring that it's going to be as fast as or faster than pre-built binaries.

this is pretty cool, are all the server arguments can be set via ENV variable? (all capitalized?)

abetlen / llama-cpp-python