Closed CISC closed 5 months ago
It doesn't work on CPU only, or with openCL either. I think quantum V cache is just not implemented yet. (see here: https://github.com/ggerganov/llama.cpp/blob/fecac45658a99eddc4d6e36ba0310ca8f87a77f0/ggml.c#L6890)
Yeah, I just noticed: #4309
seems that K cache quantization doesn't work on StableLM models, maybe on other archs too
The head size has to be a multiple of 32. I think in StableLM it is not
The head size has to be a multiple of 32. I think in StableLM it is not
Zephyr 3b config.json
"num_attention_heads": 32,
"num_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32
I tried to run the 3km quant with -ctk q8_0 and I got this:
GGML_ASSERT: llama.cpp:8934: hparams.n_embd_head() % ggml_blck_size(type_k) == 0
Aborted
The head size is equal to n_embd / n_head
This issue is stale because it has been open for 30 days with no activity.
I know this might be a bit annoying, but I was wondering if there is an estimated timeline for implementing this feature? Given the progress in quantization techniques, large models with low-bit precision are becoming increasingly practical. However, some models (like Qwen1.5-72b, an MHA model) have relatively large memory footprints for their kv-cache. For users like myself who want to work with long contexts, quantization support for the current v cache has become the most desired feature.
I know this might be a bit annoying, but I was wondering if there is an estimated timeline for implementing this feature? Given the progress in quantization techniques, large models with low-bit precision are becoming increasingly practical. However, some models (like Qwen1.5-72b, an MHA model) have relatively large memory footprints for their kv-cache. For users like myself who want to work with long contexts, quantization support for the current v cache has become the most desired feature.
Yeah, I agree - it seems the trend is for released models to have longer and longer context lengths recently.
@DesperateZero @jukofyork Maybe it would help to tag this issue differently (good-deep-dive-issue
:) ) to get someone to pick this up? Or maybe @ggerganov is planning on tackling this at some point himself?
The plan is after merging #5021 to add kernels that work with the a quantum KV cache. We are working towards this, but it might take some time to get there
This issue was closed because it has been inactive for 14 days since being marked as stale.
Perhaps change label on this issue to bug
so it doesn't go stale and auto-close?
I'm not sure if it's the right issue, but KV cache quantization is definitely the feature I'm looking forward to, given that my application reuses session dumps a lot; optimizing dump size would be very beneficial.
Quantizing the K cache (-ctk) works, however quantizing the V cache (-ctv) does not, I've tried with q4_0, q4_1, q8 etc...
Using the cublas- cu12.2.0 release build I get the following error:
llama_kv_cache_init: VRAM kv self = 336.00 MB llama_new_context_with_model: KV self size = 336.00 MiB, K (f16): 256.00 MiB, V (q4_1): 80.00 MiB llama_build_graph: non-view tensors processed: 676/676 llama_new_context_with_model: compute buffer total size = 291.07 MiB llama_new_context_with_model: VRAM scratch buffer: 288.00 MiB llama_new_context_with_model: total VRAM used: 4719.06 MiB (model: 4095.05 MiB, context: 624.00 MiB)
CUDA error 1 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: invalid argument current device: 0 GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: !"CUDA error"