Qwen 1.5 models - question about huge KV cache

UPDATE: Silly me, the Q4_K_M version barely fits in 80GB but only when loading the model with 8k context, which is a little surprising because I don't have issues with running 120b q3-q4 models with 32k context. Is the huge K/V cache in this particular case normal?

Original post: (I'm not sure if that's a bug or newer Qwen models are not supported, I can see some kind of support added in #549 but it seems that was some kind of converted model)

I tried 2 GGUFs to make sure it's not a problem with corrupted file or me joining them wrong, I didn't check smaller Qwen models.

Tested koboldcpp versions: 1.57.1 | 1.59.1 | 1.60 (cuda + windows) Repo: https://huggingface.co/Qwen/Qwen1.5-72B-Chat-GGUF Models:

qwen_1.5_72b_chat_q2_K
qwen_1.5_72b_chat_q4_K_M

Q2 immediately allocates 80GB of ram and crashes, Q4 is loading with usual speed up to ~50-60GB ram usage and spits out error with loading model. Q2 + 1.57.1 first loads model up to 40-50GB, then allocated additional 80GB, spits out warning and crashes after some time with this message:

WARNING: failed to allocate 82120.00 MB of pinned memory: out of memory
llama_kv_cache_init:        CPU KV buffer size = 82120.00 MiB
llama_new_context_with_model: KV self size  = 82120.00 MiB, K (f16): 41060.00 MiB, V (f16): 41060.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    80.29 MiB
WARNING: failed to allocate 4569.40 MB of pinned memory: out of memory
GGML_ASSERT: D:\a\koboldcpp\koboldcpp\ggml-backend.c:555: data != NULL && "failed to allocate buffer"

Q4 files were joined with:

copy /B qwen1_5-72b-chat-q4_k_m.gguf.a + qwen1_5-72b-chat-q4_k_m.gguf.b qwen1_5-72b-chat-q4_k_m.gguf

Koboldcpp config file:

{"model": null, "model_param": "F:/_MODELS/[GGUF]_qwen_1.5_72b_chat_q4_K_M/qwen1_5-72b-chat-q4_k_m.gguf", "port": 5001, "port_param": 5000, "host": "", "launch": false, "lora": null, "config": null, "threads": 8, "blasthreads": 8, "highpriority": false, "contextsize": 32768, "blasbatchsize": 512, "ropeconfig": [0.0, 10000.0], "smartcontext": true, "noshift": false, "bantokens": null, "forceversion": 0, "nommap": false, "usemlock": true, "noavx2": false, "debugmode": 0, "skiplauncher": false, "hordeconfig": null, "noblas": false, "useclblast": null, "usecublas": ["normal", "mmq"], "usevulkan": null, "gpulayers": 0, "tensor_split": null, "onready": "", "benchmark": null, "multiuser": 0, "remotetunnel": false, "foreground": false, "preloadstory": null, "quiet": false, "ssl": null, "nocertify": false}

Q2_K 1.60:

llama_model_loader: loaded meta data with 21 key-value pairs and 963 tensors from F:\_MODELS\[GGUF]_qwen_1.5_72b_chat_q2_K\qwen1ÇƒX`A>llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 8192
llm_load_print_meta: n_embd_v_gqa     = 8192
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 24576
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = unknown, may not work (guessed)
llm_load_print_meta: model params     = 72.29 B
llm_load_print_meta: model size       = 26.50 GiB (3.15 BPW)
llm_load_print_meta: general.name     = Qwen1.5-72B-Chat-AWQ-fp16
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 30 '?'
llm_load_tensors: ggml ctx size =    0.43 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/81 layers to GPU
llm_load_tensors:        CPU buffer size = 27136.88 MiB
.................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 32848
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
WARNING: failed to allocate 82120.00 MB of pinned memory: out of memory

Q4_K_M 1.60:

llama_model_loader: loaded meta data with 21 key-value pairs and 963 tensors from F:\_MODELS\[GGUF]_qwen_1.5_72b_chat_q4_K_M\qweAL5z~_llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 64
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 8192
llm_load_print_meta: n_embd_v_gqa     = 8192
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 24576
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = unknown, may not work (guessed)
llm_load_print_meta: model params     = 72.29 B
llm_load_print_meta: model size       = 41.07 GiB (4.88 BPW)
llm_load_print_meta: general.name     = Qwen1.5-72B-Chat-AWQ-fp16
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 30 '?'
llm_load_tensors: ggml ctx size =    0.43 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/81 layers to GPU
llm_load_tensors:        CPU buffer size = 42055.31 MiB
...................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 32848
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
WARNING: failed to allocate 82120.00 MB of pinned memory: out of memory
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 86109061152
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
gpttype_load_model: error: failed to load model 'F:\_MODELS\[GGUF]_qwen_1.5_72b_chat_q4_K_M\qwen1_5-72b-chat-q4_k_m.gguf'
Load Text Model OK: False
Could not load text model: F:\_MODELS\[GGUF]_qwen_1.5_72b_chat_q4_K_M\qwen1_5-72b-chat-q4_k_m.gguf

LostRuins / koboldcpp

Qwen 1.5 models - question about huge KV cache #724