Closed gustrd closed 1 month ago
At the moment, there is no flag to remove llamafile. I will add one. For now, you need to remove all matches of this -DGGML_USE_LLAMAFILE
from the makefile, and then rebuild
Can we not just delete the llama.cpp folder, clone it again and run MAKE again?
With the last version I was able to run Q4_0_4_4 just compiling from the source. Thx!
@gustrd , which quantization exactly is Q4_0_4_4 ? What quantization config do you have to specify to run this ? And fast is it compared to other quantizations on ARM ?
@Abhrant , I'm not a specialist about it, but AFAIK Q4_0_4_4 is a special type of Q4 that takes advantage from some arm optimizations, present at some newer devices.
Q4_0_4_8 uses i8mm and Q4_0_8_8 uses SVC, that are even newer technologies.
I could just test Q4_0_4_4, and it got great prompt processing increase and minor generation speed increase.
With a Snapdragon 8G1 I'm getting around 35 t/s processing and 9 t/s generation, for 3B model.
Describe the Issue Upstream we have the new feature of ARM optimized models (Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8). I tried to run every one of them at my Snapdragon 8G1, but I was unable to run it with koboldcpp.
Additional Information: Checking upstream I saw the new documentation (https://github.com/ggerganov/llama.cpp/pull/9321), that shows that some flags must be set at compilation. Can you please inform how to compile koboldcpp with those flags so I can try again?
To support `Q4_0_4_4`, you must build with `GGML_NO_LLAMAFILE=1` (`make`) or `-DGGML_LLAMAFILE=OFF` (`cmake`).