When using -quantkv 1 in addition to flash attention on my M2 max mac studio the CPU peggs at 100% while processing a long context. I’m not even sure if it answers, because I closed koboldcpp and gave up after a couple of minutes.
Without -quantkv replies the initial response is around 60 seconds and subsequent replies (as long as it doesn’t have to reprocess all context) is around 15-30 seconds.
Is quantkv cpu/gpu intensive or did I just run into a bug?
When using -quantkv 1 in addition to flash attention on my M2 max mac studio the CPU peggs at 100% while processing a long context. I’m not even sure if it answers, because I closed koboldcpp and gave up after a couple of minutes.
Without -quantkv replies the initial response is around 60 seconds and subsequent replies (as long as it doesn’t have to reprocess all context) is around 15-30 seconds.
Is quantkv cpu/gpu intensive or did I just run into a bug?