35b-beta-long (command-r finetune) iquants will not offload into gpu

LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI

https://github.com/lostruins/koboldcpp

GNU Affero General Public License v3.0

4.35k stars 312 forks source link

35b-beta-long (command-r finetune) iquants will not offload into gpu #842

Open lemon07r opened 1 month ago

lemon07r commented 1 month ago

Model in question: https://huggingface.co/bartowski/35b-beta-long-GGUF The regular q_K quants from the same repo (tested with q4k_M) offload just fine into gpu. Tested with a 6900 xt using vulkan. None of the iq4 and iq3 quants would load for me with gpu offloading, but work just fine in cpu only inference (clblast).

EDIT - Here's the last message I see on screen before it crashes:

GGML_ASSERT: ggml-vulkan.cpp:2940: !qx_needs_dequant || to_fp16_vk_0 != nullptr

LostRuins commented 3 weeks ago

I don't think vulkan supports the iq quants.