I'm using llama-cpp-python==0.2.60, installed using this command CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python.
I'm able to load a model using type_k=8 and type_v=8 (for q8_0 cache). However, as soon as I try to generate something with the model, it fails like this:
Basically, I am able to load a model with 8-bit cache, but I can't actually inference with the model.
uname -a: Darwin MacBook-Air.local 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:19:22 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8112 arm64
Hi! :)
I'm using
llama-cpp-python==0.2.60
, installed using this commandCMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
.I'm able to load a model using
type_k=8
andtype_v=8
(for q8_0 cache). However, as soon as I try to generate something with the model, it fails like this:Basically, I am able to load a model with 8-bit cache, but I can't actually inference with the model.
uname -a
:Darwin MacBook-Air.local 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:19:22 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T8112 arm64