Closed laurids-reichardt closed 3 months ago
Looks related to #3384
make clean
?-t 4
?Unfortunately neither make clean
nor setting -t 4
makes a difference. Applying the patch to the latest commit f5ef5cfb18148131fcf45bdd2331f0db5ab7c3d0 doesn't solve the issue as well:
EDIT: Reducing the context size to 7000 lowers the required memory to (21037.02 / 21845.34)
and allows running the model again.
I suppose before the #3228 change, this model was just at the limit of what is possible to fit in 32GB. Due to the changes, the memory usage has slightly increased leading to no longer possible to fit. Your workaround is probably the best option at the moment
If the issue is the increase in the alloc buffer size, reducing the batch size (-b
) may also work.
Hitting this too just now on macOS Ventura 13.5.2 with am M1 pro:
llama-cpp-python[server]==0.2.11
command: python -m llama_cpp.server --n_ctx 4096 --model models/llama-2-13b-ensemble-v5.Q4_K_M.gguf
(model link)llama-index==0.8.38
object: LlamaCPP(..., model_kwargs={"n_gpu_layers": 1, "n_ctx": 4096}).complete(prompt)
The client-side gets this error:
ggml_metal_graph_compute: command buffer 0 failed with status 5
GGML_ASSERT: /private/var/folders/78/lm6p91s90fx99cshsxqz_19w0000gn/T/pip-install-vzfiwviq/llama-cpp-python_703f6576256241f7894dbfd75e7b496f/vendor/llama.cpp/ggml-metal.m:1369: false
My context size is 4096. Any pointers on how to get around this?
The GGML_ASSERT
was not triggered for me after moving to a Q4_0 model. One suggestion is adding a human-friendly decoding of status 5 to the message command buffer 0 failed with status 5
, that way it's easier to pick a corrective action. From an out-the-code perspective, it seems status 5 is related to an unsupported quantization type
it seems status 5 is related to an unsupported quantization type
No - the device runs out of memory in this case. But status 5 can mean a variety of things.
A workaround for Q4_K_M
is to either reduce the context or the batch size
This issue was closed because it has been inactive for 14 days since being marked as stale.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Before ec893798b7a2a803466cc8f063051499ec3d96f7 llama.cpp was able to load and run CodeLlama 34B 4_K_M via Metal on 32 GB Apple M1 Max.
Output for 45855b3f1c7bdd0320aa632334d0b3e8965c26c4:
Current Behavior
Output since ec893798b7a2a803466cc8f063051499ec3d96f7: