Closed h3ndrik closed 1 year ago
Can confirm, I have the same problem with a Radeon 5600 XT:
K-quants are not supported with clblast yet. Support will have to be added upstream first.
Rip, was looking forward to trying out the new k-quants too, but GPU offloading is a must for me.
https://github.com/ggerganov/llama.cpp/issues/1725
Looks like there's an issue thread with the same error but not much activity on it.
Yes, if anyone is able to write CL kernels, contributions (either here or preferably upstream) are greatly welcome, this would be a good help.
Would a CUBLAS version work while CLBlast ain't yet compatible with the models with the new K quantizations? If yes, I understand that it's a lot of maintenance, as you mentioned before, LostRuins, but could or anyone knowing the "how to" make a tutorial for us noobs to be able to prepare then compile by ourselves a CUBLAS version of KoboldCPP from your source, with make of even cmake? I think there's a lot of folks with Nvidia cards around here, and the time required to handle long context prompts with OPENBlas is really too long to enjoy those new KQ models (for example, 33b becomes usable for me in Q3, while in Q4 it's really too slow).
I am currently working on the k-quant dequant kernels for clblast although that may take some more time. For cublas if I have a chance to setup my visual studio with cuda again I'll try to do a new build.
Thank you very much, LostRuins ! As well as for the whole KoboldCpp project to you and all the contributors. I learned to compile a source just to use its last experimental versions, and now enjoy it in tandem with Silly Tavern and its extensions!
I am currently working on the k-quant dequant kernels for clblast although that may take some more time. For cublas if I have a chance to setup my visual studio with cuda again I'll try to do a new build.
@LostRuins , would be possible for you to share the instructions for us to compile the project with CUBLAS? I could add it to the project documentation. I imagine it will be good to many users.
Please try the latest version 1.30.1 which has k-quant support in both CUDA and Clblast.
@gustrd for cublas, simply use the cmakefile to build. It should be automatic.
I tested the 1.30.2 releases of KoboldCPP (CLBLAST and CUBLAS). CLBLAST and CUBLAS both work as intended on long prompts processing (1000 tokens ; blastbatchsize 512).
Config : Ryzen 5600 (65W) with 32GB DDR4-3600 CL16 and a Nvidia Geforce 980ti (RTX 3090 coming in a few days!)
Model : WizardLM "30b" Q2_K
I obtain a twice faster 1000 tokens prompt processing (30 secs instead of 60 secs) on CUBLAS compared to CLBLAST (with 8 threads). The following tokens (8 by 8, i use the old pseudo-stream method) analysis are equivalently fast (400ms for 8 tokens). The new streaming method has a little display "bug" : the chat windows of Koboldcpp in the browser doesn't scroll down automatically, unlike its usual behavior in 8 tokens pseudo-streaming. Massive kudos and gratitude for both releases, notably for the CUBLAS one.
Sometimes koboldcpp crashes when using
--useclblast
Not using BLAS or only using OpenBLAS works fine. It only crashes when i add
--useclblast 0 0
to the command line. I'm not sure if this has to do with the new quantization method. Some models work fine, even when i use them with--useclblast
. The first model i noticed this with, was a ggmlv3 q4_K_M. But i also did a git pull and re-compiled koboldcpp. So i'm not sure what introduced the bug.I'm running the latest git 6635f7efce3389a0b15d3a01cdc85c4e65c8bccc version on Debian GNU/Linux. Compiled with
make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1
Everything worked flawlessly for a long time.Here is the cli output of a crash:
$ uname -a
Linux pc 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux
python 3.9.2
A few libraries i installed: libblas-dev, libclblas-dev, libopenblas-dev, libmkl-intel-thread
(Edit: I'm probably not using clblast with an old intel iGPU anyways, so feel free to close this issue if you don't want to fix this. Seems using intel clblast on this hardware makes everything slower, not faster.)