ggerganov / llama.cpp

LLM inference in C/C++
MIT License
60.95k stars 8.7k forks source link

Bug: After running for a while, the llama-server exhibits extremely high CPU usage, resulting in timeouts for all requests. #8128

Open moqimoqidea opened 3 days ago

moqimoqidea commented 3 days ago

What happened?

In recent updates, we upgraded the llama.cpp. After running on the server for a period, it exhibits extremely high CPU usage. Investigation through tools like strace and gdb indicates significant contention over locks.

top screenshot:

3DD5AE30-DB69-48C7-BD61-A5518442D1D0

gdb screenshot:

6F1EE477-1BBD-415B-91C2-4B7C0D6F2589

strace files:

strace-data-20240626.txt.zip

Name and Version

llama-cli --version version: 3209 (95f57bb5) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0

What operating system are you seeing the problem on?

Linux

Relevant Data

I run llama.cpp using Tabby, typically packaging it with docker build --platform linux/amd64 on an Apple M1 computer. The current Dockerfile can be found at: Tabby v.0.12.0 Dockerfile.cuda.

The command used to run the server is as follows: llama-server -m codellama-13b.Q8_0.gguf --cont-batching --port 30889 -np 3 --log-disable --ctx-size 4096 -ngl 9999, and the server is configured with an Nvidia A100 GPU.

The pods running in our container use the image nvidia/cuda:11.7.1-devel-ubuntu22.04, while the kernel version on the node is 3.10.0-1160.31.1.el7.x86_64. We suspect there may be some incompatibility or other issues between the relatively old Linux kernel on the node and the newer version of llama.cpp.

As of now, this remains an intermittent issue in our service, having occurred twice in the last two weeks. Each time the problem arises, the CPU usage in the pod spikes to around 10,000% and gradually returns to normal levels over the next few hours. The process does not crash; it recovers after an extended period and resumes normal service.