LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.16k stars 353 forks source link

Quantkv CPU requirements? #1073

Open SimplyCorbett opened 2 months ago

SimplyCorbett commented 2 months ago

When using -quantkv 1 in addition to flash attention on my M2 max mac studio the CPU peggs at 100% while processing a long context. I’m not even sure if it answers, because I closed koboldcpp and gave up after a couple of minutes.

Without -quantkv replies the initial response is around 60 seconds and subsequent replies (as long as it doesn’t have to reprocess all context) is around 15-30 seconds.

Is quantkv cpu/gpu intensive or did I just run into a bug?

LostRuins commented 2 months ago

Probably just not optimized for MacOS. The fast implementation is for cuda