GGML_ASSERT when try to load IQ1_S model.

Kas1o commented 3 months ago

llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =  1786.76 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  1786.77 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    88.77 MiB
llama_new_context_with_model: graph nodes  = 2312
llama_new_context_with_model: graph splits = 3
GGML_ASSERT: d:\a\koboldcpp\koboldcpp\ggml-cuda\dmmv.cu:804: false

The model I use: https://huggingface.co/dranger003/c4ai-command-r-plus-iMat.GGUF/blob/main/ggml-c4ai-command-r-plus-104b-iq1_s.gguf

by check the dmmv.cu:804 file, I noticed It doesn't contains case IQ1_S. Is it not supported?

Yuki7k commented 2 months ago

I got the same error only using Llama-3-DARE-8B.IQ3_M.gguf

GGML_ASSERT: d:\a\koboldcpp\koboldcpp\ggml-cuda\dmmv.cu:804: false

It's my first time attempting to use one of these IQ GGUFs so I guess it's related to that?

Kas1o commented 1 month ago

I think I found the reason: there's two algos here: dmmv and mmvq, It is selected based on cuda device Compute Capability.

MMVQ for 6.1/Pascal/GTX 1000 or higher)

https://github.com/ggerganov/llama.cpp/README.md

and the point is dmmv does not support importance matrix.

by the way. The selection was base on the oldest gpu on your system. Even if you select a newer GPU on the startup screen.

LostRuins / koboldcpp

GGML_ASSERT when try to load IQ1_S model. #776