abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.15k stars 969 forks source link

KV cache quantization fails with GGML_ASSERT #1335

Open ddh0 opened 7 months ago

ddh0 commented 7 months ago

Hi! :)

I'm using llama-cpp-python==0.2.60, installed using this command CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python.

I'm able to load a model using type_k=8 and type_v=8 (for q8_0 cache). However, as soon as I try to generate something with the model, it fails like this:

GGML_ASSERT: /private/var/folders/vy/88dklssj0zj64m3xdkv2j0wc0000gn/T/pip-install-ptov4y0g/llama-cpp-python_556e2fb52bea42419cf695448fc31a0c/vendor/llama.cpp/ggml.c:7615: false
GGML_ASSERT: /private/var/folders/vy/88dklssj0zj64m3xdkv2j0wc0000gn/T/pip-install-ptov4y0g/llama-cpp-python_556e2fb52bea42419cf695448fc31a0c/vendor/llama.cpp/ggml.c:7615: false
GGML_ASSERT: /private/var/folders/vy/88dklssj0zj64m3xdkv2j0wc0000gn/T/pip-install-ptov4y0g/llama-cpp-python_556e2fb52bea42419cf695448fc31a0c/vendor/llama.cpp/ggml.c:7615: false
GGML_ASSERT: /private/var/folders/vy/88dklssj0zj64m3xdkv2j0wc0000gn/T/pip-install-ptov4y0g/llama-cpp-python_556e2fb52bea42419cf695448fc31a0c/vendor/llama.cpp/ggml.c:7615: false
zsh: abort      python

Basically, I am able to load a model with 8-bit cache, but I can't actually inference with the model.

uname -a: Darwin MacBook-Air.local 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:19:22 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8112 arm64

CISC commented 7 months ago

Changing type_v is not yet supported by llama.cpp, see this issue.