Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
20.62k stars 1.04k forks source link

Bug: segfault loading models with KV quantization and related problems #610

Open mseri opened 2 weeks ago

mseri commented 2 weeks ago

Contact Details

marcello.seri@gmail.com

What happened?

Loading any GGUF with --cache-type-k q8_0 --cache-type-v q8_0 (or any other quantization) makes the server segfault. This should fail mentioning that KV quantization only works with flash attention (--flash_attn).

Invoking the cli with --cache-type-k q8_0 --cache-type-v q8_0 --flash_attn everything seems to work properly, however the answers are complete rubbish (E.g. an infinite stream of exclamation marks or similar things).

This works fine in llama.ccp, which is why I decided to raise the issue.

Version

llamafile v0.8.16 (main branch HEAD at 099534371b38d1bf52047d4d3efd8f2dc56156db)

What operating system are you seeing the problem on?

MacOS Sequoia 15.1

Relevant log output

❯ o/llama.cpp/main/main --cache-type-k q8_0 --cache-type-v q8_0 --model  /Volumes/ext/store/lm-studio/models/lmstudio-community/Qwen2.5-3B-Instruct-GGUF/Qwen2.5-3B-Instruct-Q4_K_M.gguf

██╗     ██╗      █████╗ ███╗   ███╗ █████╗ ███████╗██╗██╗     ███████╗
██║     ██║     ██╔══██╗████╗ ████║██╔══██╗██╔════╝██║██║     ██╔════╝
██║     ██║     ███████║██╔████╔██║███████║█████╗  ██║██║     █████╗
██║     ██║     ██╔══██║██║╚██╔╝██║██╔══██║██╔══╝  ██║██║     ██╔══╝
███████╗███████╗██║  ██║██║ ╚═╝ ██║██║  ██║██║     ██║███████╗███████╗
╚══════╝╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚══════╝╚══════╝
 launching server...
error: Uncaught SIGSEGV (SEGV_ACCERR) on MacMini.local pid 16610 tid 262144
 /Volumes/ext/code/cpp/llamafile/o/llama.cpp/main/main
 Darwin Cosmopolitan 3.9.6 MODE=aarch64; Darwin Kernel Version 24.1.0: Thu Oct 10 21:05:14 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T8103 MacMini.local 24.1.0
 cosmoaddr2line /Volumes/ext/code/cpp/llamafile/o/llama.cpp/main/main.aarch64.elf 8001190a8 8000199a4 800013880 80009c3a0 8002dbc0c 8002ff4f8
 faulting address is 0000000000000008
 0000000000000000 x0 0000000000000000 x8  0000000000000001 x16 0000000100c1e630 x24
 0000000800359cb7 x1 0000000000000000 x9  00000001f788d8b0 x17 0000000000000000 x25
 0000000800359b6b x2 0000000000000004 x10 0000000000000000 x18 0000000000000000 x26
 0000000000000030 x3 0000000000000004 x11 0000000100c1ea00 x19 000000016f5e6f90 x27
 0000000100c1d8e8 x4 0124924924924924 x12 00000001009cc6c0 x20 0000000104433b40 x28
 0000000100c1d870 x5 000000000000000b x13 0000000100c1dc91 x21 0000000100c1d7e0 x29
 0000000100c1f031 x6 0000000000000000 x14 0000000100c1e630 x22 00000008000199a4 x30
 7f7f7f7f7f7f7f7f x7 0000000000000000 x15 0000000100c1f3c0 x23 0000000100c1d7e0 x31
 0000000100c1d7e0 sp 8001190a8 pc llama_n_ctx+12
 0000000100c1d7e0 sp 8000199a4 lr llama_server_context::load_model(gpt_params const&)+380
 0000000100c1d910 fp 800013880 lr server_cli(int, char**)+3128
 0000000100c1ff00 fp 80009c3a0 lr server_thread(void*)+80
 0000000100c1ff60 fp 8002dbc0c lr PosixThread+116
 0000000100c1ff70 fp 8002ff4f8 lr __stack_call+24
zsh: segmentation fault  o/llama.cpp/main/main --cache-type-k q8_0 --cache-type-v q8_0 --model
❯ o/llama.cpp/main/main --cache-type-k q8_0 --cache-type-v q8_0 --flash_attn --model  /Volumes/ext/store/lm-studio/models/lmstudio-community/Qwen2.5-3B-Instruct-GGUF/Qwen2.5-3B-Instruct-Q4_K_M.gguf --chat

██╗     ██╗      █████╗ ███╗   ███╗ █████╗ ███████╗██╗██╗     ███████╗
██║     ██║     ██╔══██╗████╗ ████║██╔══██╗██╔════╝██║██║     ██╔════╝
██║     ██║     ███████║██╔████╔██║███████║█████╗  ██║██║     █████╗
██║     ██║     ██╔══██║██║╚██╔╝██║██╔══██║██╔══╝  ██║██║     ██╔══╝
███████╗███████╗██║  ██║██║ ╚═╝ ██║██║  ██║██║     ██║███████╗███████╗
╚══════╝╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚══════╝╚══════╝
software: llamafile 0.8.16
model:    Qwen2.5-3B-Instruct-Q4_K_M.gguf
compute:  Apple Metal GPU

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
>>> This is a test for quantized KV store
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!^C!
>>> This is a test for quantized KV store
!!!!!!!!!!!!!!!!!!^C!
>>> How are you>
!!!!!?!!!!!!!!!!!!!!!!!!!!!!!!!!!!^C!
>>>