Closed Nexesenex closed 4 weeks ago
Not a safe assumption to make.
Right now if headcount is not set, it falls back to an older, safer algorithm to prevent OOM. In order for this to be used safely it would need to be tested on a wide range of models and setups.
Correct me if I'm wrong, but headcount acts as a multiplicator of the size of the context buffer, not a divisor. Is there models where there are more kv heads than layers? :X If not, then I guess that the assumption is safe.
TinyLlama 1.1b?
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
there are probably others too. my point is that, these are separate unrelated values and it just seems like a very risky assumption to make. How did you even come to this conclusion to begin with?
Well, indeed, Tinyllama makes an exception, among others (like Command-R +). I didn't remember that. The ratio can reach 1.5x heads/layers, at least. I modified in consequence my own AL.
If the headcount is not registered in the GGUF metadata, due to an incomplete config.json during the initial conversion in GGUF, or due to a conversion of a GGML model into a GGUF model, then let's assume that headcount = layers, as it was the case during the pre-GQA era.
I just encountered the case today by downloading this model : https://huggingface.co/mradermacher/airoboros-65b-gpt4-1.4.1-PI-8192-fp16-i1-GGUF
And I imagine there's several, if not many similar cases.