Unable to run models with the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats at ARM device.

LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

https://github.com/lostruins/koboldcpp

GNU Affero General Public License v3.0

5.3k stars 362 forks source link

Unable to run models with the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats at ARM device. #1117

Closed gustrd closed 1 month ago

gustrd commented 2 months ago

Describe the Issue Upstream we have the new feature of ARM optimized models (Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8). I tried to run every one of them at my Snapdragon 8G1, but I was unable to run it with koboldcpp.

Additional Information: Checking upstream I saw the new documentation (https://github.com/ggerganov/llama.cpp/pull/9321), that shows that some flags must be set at compilation. Can you please inform how to compile koboldcpp with those flags so I can try again?

To support `Q4_0_4_4`, you must build with `GGML_NO_LLAMAFILE=1` (`make`) or `-DGGML_LLAMAFILE=OFF` (`cmake`).

LostRuins commented 2 months ago

At the moment, there is no flag to remove llamafile. I will add one. For now, you need to remove all matches of this -DGGML_USE_LLAMAFILE from the makefile, and then rebuild

Abhrant commented 1 month ago

Can we not just delete the llama.cpp folder, clone it again and run MAKE again?

gustrd commented 1 month ago

With the last version I was able to run Q4_0_4_4 just compiling from the source. Thx!

Abhrant commented 1 month ago

@gustrd , which quantization exactly is Q4_0_4_4 ? What quantization config do you have to specify to run this ? And fast is it compared to other quantizations on ARM ?

gustrd commented 1 month ago

@Abhrant , I'm not a specialist about it, but AFAIK Q4_0_4_4 is a special type of Q4 that takes advantage from some arm optimizations, present at some newer devices.

Q4_0_4_8 uses i8mm and Q4_0_8_8 uses SVC, that are even newer technologies.

I could just test Q4_0_4_4, and it got great prompt processing increase and minor generation speed increase.

With a Snapdragon 8G1 I'm getting around 35 t/s processing and 9 t/s generation, for 3B model.