LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.32k stars 310 forks source link

--quantkv error with Metal #906

Open Azirine opened 3 weeks ago

Azirine commented 3 weeks ago

I'm getting this error when using --quantkv with Metal.

GGML_ASSERT: ggml-metal.m:924: !"unsupported op"

python3.11 koboldcpp.py Mistral-7B-Instruct-v0.3-Q8_0.gguf --nommap --flashattention --gpulayers 99 --contextsize 32768 --quantkv 1


***
Welcome to KoboldCpp - Version 1.67
Warning: OpenBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp_default.so
==========
Namespace(model=None, model_param='Mistral-7B-Instruct-v0.3-Q8_0.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=7, usecublas=None, usevulkan=None, useclblast=None, noblas=False, contextsize=32768, gpulayers=99, tensor_split=None, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=7, lora=None, noshift=False, nommap=True, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, onready='', benchmark=None, multiuser=0, remotetunnel=False, highpriority=False, foreground=False, preloadstory='', quiet=False, ssl=None, nocertify=False, mmproj='', password=None, ignoremissing=False, chatcompletionsadapter='', flashattention=True, quantkv=1, forceversion=0, smartcontext=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=0, sdclamped=False, sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None)
==========
Loading model: Mistral-7B-Instruct-v0.3-Q8_0.gguf

The reported GGUF Arch is: llama


Identified as GGUF model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from Mistral-7B-Instruct-v0.3-Q8_0.gguf (version GGUF V3 (latest)) llm_load_vocab: special tokens cache size = 1027 llm_load_vocab: token to piece cache size = 0.1731 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32768 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = unknown, may not work (guessed) llm_load_print_meta: model params = 7.25 B llm_load_print_meta: model size = 7.17 GiB (8.50 BPW) llm_load_print_meta: general.name = Mistral-7B-Instruct-v0.3 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 781 '<0x0A>' llm_load_tensors: ggml ctx size = 0.34 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 136.00 MiB llm_load_tensors: Metal buffer size = 7209.02 MiB ................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: Metal KV buffer size = 2176.00 MiB llama_new_context_with_model: KV self size = 2176.00 MiB, K (q8_0): 1088.00 MiB, V (q8_0): 1088.00 MiB llama_new_context_with_model: CPU output buffer size = 0.13 MiB llama_new_context_with_model: Metal compute buffer size = 128.00 MiB llama_new_context_with_model: CPU compute buffer size = 72.01 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 2 GGML_ASSERT: ggml-metal.m:924: !"unsupported op" zsh: abort python3.11 koboldcpp.py Mistral-7B-Instruct-v0.3-Q8_0.ggu

Works fine without --quantkv.
>python3.11 koboldcpp.py Mistral-7B-Instruct-v0.3-Q8_0.gguf --nommap --flashattention --gpulayers 99 --contextsize 32768

Welcome to KoboldCpp - Version 1.67 Warning: OpenBLAS library file not found. Non-BLAS library will be used. Initializing dynamic library: koboldcpp_default.so

Namespace(model=None, model_param='Mistral-7B-Instruct-v0.3-Q8_0.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=7, usecublas=None, usevulkan=None, useclblast=None, noblas=False, contextsize=32768, gpulayers=99, tensor_split=None, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=7, lora=None, noshift=False, nommap=True, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, onready='', benchmark=None, multiuser=0, remotetunnel=False, highpriority=False, foreground=False, preloadstory='', quiet=False, ssl=None, nocertify=False, mmproj='', password=None, ignoremissing=False, chatcompletionsadapter='', flashattention=True, quantkv=0, forceversion=0, smartcontext=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=0, sdclamped=False, sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None)

Loading model: Mistral-7B-Instruct-v0.3-Q8_0.gguf

The reported GGUF Arch is: llama


Identified as GGUF model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from Mistral-7B-Instruct-v0.3-Q8_0.gguf (version GGUF V3 (latest)) llm_load_vocab: special tokens cache size = 1027 llm_load_vocab: token to piece cache size = 0.1731 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32768 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = unknown, may not work (guessed) llm_load_print_meta: model params = 7.25 B llm_load_print_meta: model size = 7.17 GiB (8.50 BPW) llm_load_print_meta: general.name = Mistral-7B-Instruct-v0.3 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 781 '<0x0A>' llm_load_tensors: ggml ctx size = 0.34 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 136.00 MiB llm_load_tensors: Metal buffer size = 7209.02 MiB ................................................................................................... Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 33024 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: Metal KV buffer size = 4128.00 MiB llama_new_context_with_model: KV self size = 4128.00 MiB, K (f16): 2064.00 MiB, V (f16): 2064.00 MiB llama_new_context_with_model: CPU output buffer size = 0.13 MiB llama_new_context_with_model: Metal compute buffer size = 128.75 MiB llama_new_context_with_model: CPU compute buffer size = 72.51 MiB llama_new_context_with_model: graph nodes = 903 llama_new_context_with_model: graph splits = 2 Load Text Model OK: True Embedded Kobold Lite loaded. Embedded API docs loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

Please connect to custom endpoint at http://localhost:5001


System: MacOS 14.5, M3 Max
LostRuins commented 3 weeks ago

I believe the quantized KV support has not yet been implemented for Metal.