Loading any GGUF with --cache-type-k q8_0 --cache-type-v q8_0 (or any other quantization) makes the server segfault. This should fail mentioning that KV quantization only works with flash attention (--flash_attn).
Invoking the cli with --cache-type-k q8_0 --cache-type-v q8_0 --flash_attn everything seems to work properly, however the answers are complete rubbish (E.g. an infinite stream of exclamation marks or similar things).
This works fine in llama.ccp, which is why I decided to raise the issue.
Version
llamafile v0.8.16 (main branch HEAD at 099534371b38d1bf52047d4d3efd8f2dc56156db)
What operating system are you seeing the problem on?
MacOS Sequoia 15.1
Relevant log output
❯ o/llama.cpp/main/main --cache-type-k q8_0 --cache-type-v q8_0 --model /Volumes/ext/store/lm-studio/models/lmstudio-community/Qwen2.5-3B-Instruct-GGUF/Qwen2.5-3B-Instruct-Q4_K_M.gguf
██╗ ██╗ █████╗ ███╗ ███╗ █████╗ ███████╗██╗██╗ ███████╗
██║ ██║ ██╔══██╗████╗ ████║██╔══██╗██╔════╝██║██║ ██╔════╝
██║ ██║ ███████║██╔████╔██║███████║█████╗ ██║██║ █████╗
██║ ██║ ██╔══██║██║╚██╔╝██║██╔══██║██╔══╝ ██║██║ ██╔══╝
███████╗███████╗██║ ██║██║ ╚═╝ ██║██║ ██║██║ ██║███████╗███████╗
╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝╚══════╝╚══════╝
launching server...
error: Uncaught SIGSEGV (SEGV_ACCERR) on MacMini.local pid 16610 tid 262144
/Volumes/ext/code/cpp/llamafile/o/llama.cpp/main/main
Darwin Cosmopolitan 3.9.6 MODE=aarch64; Darwin Kernel Version 24.1.0: Thu Oct 10 21:05:14 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T8103 MacMini.local 24.1.0
cosmoaddr2line /Volumes/ext/code/cpp/llamafile/o/llama.cpp/main/main.aarch64.elf 8001190a8 8000199a4 800013880 80009c3a0 8002dbc0c 8002ff4f8
faulting address is 0000000000000008
0000000000000000 x0 0000000000000000 x8 0000000000000001 x16 0000000100c1e630 x24
0000000800359cb7 x1 0000000000000000 x9 00000001f788d8b0 x17 0000000000000000 x25
0000000800359b6b x2 0000000000000004 x10 0000000000000000 x18 0000000000000000 x26
0000000000000030 x3 0000000000000004 x11 0000000100c1ea00 x19 000000016f5e6f90 x27
0000000100c1d8e8 x4 0124924924924924 x12 00000001009cc6c0 x20 0000000104433b40 x28
0000000100c1d870 x5 000000000000000b x13 0000000100c1dc91 x21 0000000100c1d7e0 x29
0000000100c1f031 x6 0000000000000000 x14 0000000100c1e630 x22 00000008000199a4 x30
7f7f7f7f7f7f7f7f x7 0000000000000000 x15 0000000100c1f3c0 x23 0000000100c1d7e0 x31
0000000100c1d7e0 sp 8001190a8 pc llama_n_ctx+12
0000000100c1d7e0 sp 8000199a4 lr llama_server_context::load_model(gpt_params const&)+380
0000000100c1d910 fp 800013880 lr server_cli(int, char**)+3128
0000000100c1ff00 fp 80009c3a0 lr server_thread(void*)+80
0000000100c1ff60 fp 8002dbc0c lr PosixThread+116
0000000100c1ff70 fp 8002ff4f8 lr __stack_call+24
zsh: segmentation fault o/llama.cpp/main/main --cache-type-k q8_0 --cache-type-v q8_0 --model
❯ o/llama.cpp/main/main --cache-type-k q8_0 --cache-type-v q8_0 --flash_attn --model /Volumes/ext/store/lm-studio/models/lmstudio-community/Qwen2.5-3B-Instruct-GGUF/Qwen2.5-3B-Instruct-Q4_K_M.gguf --chat
██╗ ██╗ █████╗ ███╗ ███╗ █████╗ ███████╗██╗██╗ ███████╗
██║ ██║ ██╔══██╗████╗ ████║██╔══██╗██╔════╝██║██║ ██╔════╝
██║ ██║ ███████║██╔████╔██║███████║█████╗ ██║██║ █████╗
██║ ██║ ██╔══██║██║╚██╔╝██║██╔══██║██╔══╝ ██║██║ ██╔══╝
███████╗███████╗██║ ██║██║ ╚═╝ ██║██║ ██║██║ ██║███████╗███████╗
╚══════╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝╚══════╝╚══════╝
software: llamafile 0.8.16
model: Qwen2.5-3B-Instruct-Q4_K_M.gguf
compute: Apple Metal GPU
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
>>> This is a test for quantized KV store
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!^C!
>>> This is a test for quantized KV store
!!!!!!!!!!!!!!!!!!^C!
>>> How are you>
!!!!!?!!!!!!!!!!!!!!!!!!!!!!!!!!!!^C!
>>>
Contact Details
marcello.seri@gmail.com
What happened?
Loading any GGUF with
--cache-type-k q8_0 --cache-type-v q8_0
(or any other quantization) makes the server segfault. This should fail mentioning that KV quantization only works with flash attention (--flash_attn
).Invoking the cli with
--cache-type-k q8_0 --cache-type-v q8_0 --flash_attn
everything seems to work properly, however the answers are complete rubbish (E.g. an infinite stream of exclamation marks or similar things).This works fine in llama.ccp, which is why I decided to raise the issue.
Version
llamafile v0.8.16 (main branch HEAD at 099534371b38d1bf52047d4d3efd8f2dc56156db)
What operating system are you seeing the problem on?
MacOS Sequoia 15.1
Relevant log output