Tweak mssing headcount in GGUF file for GPU autolayers

LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

https://github.com/lostruins/koboldcpp

GNU Affero General Public License v3.0

5.35k stars 364 forks source link

Tweak mssing headcount in GGUF file for GPU autolayers #1191

Closed Nexesenex closed 4 weeks ago

Nexesenex commented 1 month ago

If the headcount is not registered in the GGUF metadata, due to an incomplete config.json during the initial conversion in GGUF, or due to a conversion of a GGML model into a GGUF model, then let's assume that headcount = layers, as it was the case during the pre-GQA era.

I just encountered the case today by downloading this model : https://huggingface.co/mradermacher/airoboros-65b-gpt4-1.4.1-PI-8192-fp16-i1-GGUF

And I imagine there's several, if not many similar cases.

LostRuins commented 1 month ago

Not a safe assumption to make.

Right now if headcount is not set, it falls back to an older, safer algorithm to prevent OOM. In order for this to be used safely it would need to be tested on a wide range of models and setups.

Nexesenex commented 1 month ago

Correct me if I'm wrong, but headcount acts as a multiplicator of the size of the context buffer, not a divisor. Is there models where there are more kv heads than layers? :X If not, then I guess that the assumption is safe.

LostRuins commented 4 weeks ago

TinyLlama 1.1b?

llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4

there are probably others too. my point is that, these are separate unrelated values and it just seems like a very risky assumption to make. How did you even come to this conclusion to begin with?

Nexesenex commented 4 weeks ago

Well, indeed, Tinyllama makes an exception, among others (like Command-R +). I didn't remember that. The ratio can reach 1.5x heads/layers, at least. I modified in consequence my own AL.