Open jaredquekjz opened 11 months ago
you could try my container: https://hub.docker.com/r/3x3cut0r/llama-cpp-python
i implemented all supported options to an env variable. tell me what do you think and please tell me any bugs.
Thanks for ur attention. So I tried the Docker but the GPU isn't being activated even though the uvicorn server starting. This is my Docker run:
--name llama-cpp-python \
--cap-add SYS_RESOURCE \
-v /home/jaredquek/text-generation-webui/models:/models \
-p 8000:8000/tcp \
3x3cut0r/llama-cpp-python:latest \
--model /models/tulu-2-dpo-70b.Q5_K_M.gguf \
--n_gpu_layers 81 \
--chat_format chatml \
--use_mlock False
Does the Docker image run Cuda acceleration by default or I have to do some other thing? Also would you know which parameter to adjust should I wish to handle many concurrent requests through the server? i understand that for the llama cpp server it's done by ngl: https://github.com/ggerganov/llama.cpp/pull/3228. Thanks for your advice!
Unfortunately, this alpine based image is built with these CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS -DLLAMA_AVX=OFF -DLLAMA_AVX2=OFF -DLLAMA_F16C=OFF -DLLAMA_FMA=OFF". This means it is optimized for older CPUs and GPU support is deactivated.
An image optimized for GPUs with cuda would need a different base image anyway. Unfortunately I don't have an Nvidia GPU myself and therefore can't test or deploy anything. But maybe i can have a look activating gpu support (without cuda). But then llama-cpp-python needs to be recompiled after image creation. or i could deploy i on another tag. need to think about that.
I have also not yet dealt with your question about parallel requests. But I would also be very interested into that too.
Sorry
Thanks! Perhaps that's true with abetlen
's original image too - not for Cuda? Have to look deeply. I have managed to get the non Docker version of the server working already. However of course I prefer the stability of Docker, and need to find out about the parallel requests.
@jaredquekjz there are two options really
--n_gpu_layers
is equivalent to N_GPU_LAYERS
.-m llama_cpp.server
followed by the cli args, this should allow you to pass either cli or environment variable arguments.The benefit to using the default entrypoint and environment variables with the official image is that it includes a compiler and will rebuild the image for any cpu architecture you deploy it to ensuring that it's going to be as fast as or faster than pre-built binaries.
@jaredquekjz there are two options really
- Use environment variables instead of cli args. This is the slightly more idiomatic solution for containers and every cli argument has a corresponding environment variable, so
--n_gpu_layers
is equivalent toN_GPU_LAYERS
.- Change your entrypoint to python in the docker command and run with
-m llama_cpp.server
followed by the cli args, this should allow you to pass either cli or environment variable arguments.The benefit to using the default entrypoint and environment variables with the official image is that it includes a compiler and will rebuild the image for any cpu architecture you deploy it to ensuring that it's going to be as fast as or faster than pre-built binaries.
this is pretty cool, are all the server arguments can be set via ENV variable? (all capitalized?)
Title:
Issue with Passing Custom Arguments to
llama_cpp.server
in DockerIssue Description:
Hello
abetlen
,I've been trying to use your Docker image
ghcr.io/abetlen/llama-cpp-python:v0.2.24
forllama_cpp.server
, and I encountered some difficulties when attempting to pass custom arguments (--n_gpu_layers 81
,--chat_format chatml
,--use_mlock False
) to the server through Docker.Steps to Reproduce:
docker pull ghcr.io/abetlen/llama-cpp-python:v0.2.24
Run the container with custom arguments:
This results in an error:
Error: No such option: --n_gpu_layers
.Expected Behavior:
I expected to be able to pass these arguments to the
llama_cpp.server
application inside the Docker container.Actual Behavior:
The
uvicorn
command does not recognize these arguments as it's designed for the ASGI server, not thellama_cpp.server
application.Potential Solutions:
I would appreciate any assistance or guidance you could provide on this issue.
Thank you for your time and for maintaining this project.
Best regards.