ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.41k stars 9.55k forks source link

Bug: Initializing KV Cache Spikes Memory, Crashing on Android #9671

Closed amqdn closed 2 weeks ago

amqdn commented 2 weeks ago

What happened?

Hi,

You may already know about the memory spike, given #7474.

For those unfamiliar, ggml_backend_cpu_buffer_clear calls memset, which initializes the allocated buffer (as big as 16 GiB for full context on Llama 3.1) to 0s, spiking memory and, on Android, leading to a system crash --

As far as I can tell, there are no guards for when llama.cpp might over-allocate and over-initialize memory — this may be intended, but it seems to defeat the purpose of mmap.

Please share your perspective on this behavior; I understand it to be undefined. With limited experience, I see a number of potential solutions:

  1. Make ggml_backend_buffer_clear truly optional
    • Alternatively, skip it in certain environments
  2. Use ggml_backend_cpu_buffer_memset_tensor in the alloc_tensor_range loop instead to avoid bulk initialization, perhaps as part of ggml_tallocr_alloc or in a separate function
  3. Require -c in certain environments

To reproduce this behavior, build for Android and run llama-cli or llama-simple or llama-server with any quantization of Llama 3.1; the default behavior of llama.cpp without -c is to obtain the context from the model itself, which will load the full context in this case.

I would be happy to implement a fix, whatever is decided. If instead downstream applications should manage this themselves, please clarify.

Thank you.

Name and Version

Termux

$ ./llama-simple --version
version: 3830 (b5de3b74)
built with clang version 18.1.8 for aarch64-unknown-linux-android29

adb shell

$ LD_LIBRARY_PATH=lib ./llama-simple --version                                                  
version: 3830 (b5de3b74)
built with Android (8490178, based on r450784d) clang version 14.0.6 (https://android.googlesource.com/toolchain/llvm-project 4c603efb0cca074e9238af8b4106c30add4418f6) for x86_64-unknown-linux-gnu

What operating system are you seeing the problem on?

Other? (Please let us know in description)

Relevant log output

No response

ggerganov commented 2 weeks ago

The KV cache buffer zeroing is necessary. Without it, certain operations that rely on padded tensor data could access uninitialized memory which could result in nans.

IMO user applications should have some awareness of the context size that they require and allocate only the necessary amount. The llama.cpp examples use the full training context by default, but this is not required and user code can choose any context size that makes sense for the app.

amqdn commented 2 weeks ago

I see. So, then, shall we leave it at: User applications are responsible for selecting an appropriate context size?

ggerganov commented 2 weeks ago

Yes, I think so. At least I can't think of a better solution.

amqdn commented 2 weeks ago

Understood. I will close this issue. Thank you.