ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.67k stars 9.42k forks source link

Quantizing V cache not working yet #4425

Closed CISC closed 4 months ago

CISC commented 9 months ago

Quantizing the K cache (-ctk) works, however quantizing the V cache (-ctv) does not, I've tried with q4_0, q4_1, q8 etc...

Using the cublas- cu12.2.0 release build I get the following error:

llama_kv_cache_init: VRAM kv self = 336.00 MB llama_new_context_with_model: KV self size = 336.00 MiB, K (f16): 256.00 MiB, V (q4_1): 80.00 MiB llama_build_graph: non-view tensors processed: 676/676 llama_new_context_with_model: compute buffer total size = 291.07 MiB llama_new_context_with_model: VRAM scratch buffer: 288.00 MiB llama_new_context_with_model: total VRAM used: 4719.06 MiB (model: 4095.05 MiB, context: 624.00 MiB)

CUDA error 1 at D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: invalid argument current device: 0 GGML_ASSERT: D:\a\llama.cpp\llama.cpp\ggml-cuda.cu:7596: !"CUDA error"

stduhpf commented 9 months ago

It doesn't work on CPU only, or with openCL either. I think quantum V cache is just not implemented yet. (see here: https://github.com/ggerganov/llama.cpp/blob/fecac45658a99eddc4d6e36ba0310ca8f87a77f0/ggml.c#L6890)

CISC commented 9 months ago

Yeah, I just noticed: #4309

Ar57m commented 9 months ago

seems that K cache quantization doesn't work on StableLM models, maybe on other archs too

ggerganov commented 9 months ago

The head size has to be a multiple of 32. I think in StableLM it is not

Ar57m commented 9 months ago

The head size has to be a multiple of 32. I think in StableLM it is not

Zephyr 3b config.json

  "num_attention_heads": 32,
  "num_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32

I tried to run the 3km quant with -ctk q8_0 and I got this:

GGML_ASSERT: llama.cpp:8934: hparams.n_embd_head() % ggml_blck_size(type_k) == 0
Aborted
ggerganov commented 9 months ago

The head size is equal to n_embd / n_head

github-actions[bot] commented 6 months ago

This issue is stale because it has been open for 30 days with no activity.

DesperateZero commented 6 months ago

I know this might be a bit annoying, but I was wondering if there is an estimated timeline for implementing this feature? Given the progress in quantization techniques, large models with low-bit precision are becoming increasingly practical. However, some models (like Qwen1.5-72b, an MHA model) have relatively large memory footprints for their kv-cache. For users like myself who want to work with long contexts, quantization support for the current v cache has become the most desired feature.

jukofyork commented 6 months ago

I know this might be a bit annoying, but I was wondering if there is an estimated timeline for implementing this feature? Given the progress in quantization techniques, large models with low-bit precision are becoming increasingly practical. However, some models (like Qwen1.5-72b, an MHA model) have relatively large memory footprints for their kv-cache. For users like myself who want to work with long contexts, quantization support for the current v cache has become the most desired feature.

Yeah, I agree - it seems the trend is for released models to have longer and longer context lengths recently.

CISC commented 6 months ago

@DesperateZero @jukofyork Maybe it would help to tag this issue differently (good-deep-dive-issue :) ) to get someone to pick this up? Or maybe @ggerganov is planning on tackling this at some point himself?

ggerganov commented 6 months ago

The plan is after merging #5021 to add kernels that work with the a quantum KV cache. We are working towards this, but it might take some time to get there

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

CISC commented 4 months ago

Perhaps change label on this issue to bug so it doesn't go stale and auto-close?

vladfaust commented 2 months ago

I'm not sure if it's the right issue, but KV cache quantization is definitely the feature I'm looking forward to, given that my application reuses session dumps a lot; optimizing dump size would be very beneficial.