In recent updates, we upgraded the llama.cpp. After running on the server for a period, it exhibits extremely high CPU usage. Investigation through tools like strace and gdb indicates significant contention over locks.
llama-cli --version
version: 3209 (95f57bb5)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0
What operating system are you seeing the problem on?
Linux
Relevant Data
I run llama.cpp using Tabby, typically packaging it with docker build --platform linux/amd64 on an Apple M1 computer. The current Dockerfile can be found at: Tabby v.0.12.0 Dockerfile.cuda.
The command used to run the server is as follows: llama-server -m codellama-13b.Q8_0.gguf --cont-batching --port 30889 -np 3 --log-disable --ctx-size 4096 -ngl 9999, and the server is configured with an Nvidia A100 GPU.
The pods running in our container use the image nvidia/cuda:11.7.1-devel-ubuntu22.04, while the kernel version on the node is 3.10.0-1160.31.1.el7.x86_64. We suspect there may be some incompatibility or other issues between the relatively old Linux kernel on the node and the newer version of llama.cpp.
As of now, this remains an intermittent issue in our service, having occurred twice in the last two weeks. Each time the problem arises, the CPU usage in the pod spikes to around 10,000% and gradually returns to normal levels over the next few hours. The process does not crash; it recovers after an extended period and resumes normal service.
What happened?
In recent updates, we upgraded the llama.cpp. After running on the server for a period, it exhibits extremely high CPU usage. Investigation through tools like
strace
andgdb
indicates significant contention over locks.top screenshot:
gdb screenshot:
strace files:
strace-data-20240626.txt.zip
Name and Version
llama-cli --version version: 3209 (95f57bb5) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0
What operating system are you seeing the problem on?
Linux
Relevant Data
I run llama.cpp using Tabby, typically packaging it with
docker build --platform linux/amd64
on an Apple M1 computer. The current Dockerfile can be found at: Tabby v.0.12.0 Dockerfile.cuda.The command used to run the server is as follows:
llama-server -m codellama-13b.Q8_0.gguf --cont-batching --port 30889 -np 3 --log-disable --ctx-size 4096 -ngl 9999
, and the server is configured with an Nvidia A100 GPU.The pods running in our container use the image
nvidia/cuda:11.7.1-devel-ubuntu22.04
, while the kernel version on the node is3.10.0-1160.31.1.el7.x86_64
. We suspect there may be some incompatibility or other issues between the relatively old Linux kernel on the node and the newer version of llama.cpp.As of now, this remains an intermittent issue in our service, having occurred twice in the last two weeks. Each time the problem arises, the CPU usage in the pod spikes to around 10,000% and gradually returns to normal levels over the next few hours. The process does not crash; it recovers after an extended period and resumes normal service.