LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with a KoboldAI UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.48k stars 323 forks source link

Crash with --useclblast (GGML_ASSERT: ggml-opencl.cpp:1019: to_fp32_cl != nullptr) #222

Closed h3ndrik closed 1 year ago

h3ndrik commented 1 year ago

Sometimes koboldcpp crashes when using --useclblast

Not using BLAS or only using OpenBLAS works fine. It only crashes when i add --useclblast 0 0 to the command line. I'm not sure if this has to do with the new quantization method. Some models work fine, even when i use them with --useclblast. The first model i noticed this with, was a ggmlv3 q4_K_M. But i also did a git pull and re-compiled koboldcpp. So i'm not sure what introduced the bug.

I'm running the latest git 6635f7efce3389a0b15d3a01cdc85c4e65c8bccc version on Debian GNU/Linux. Compiled with make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 Everything worked flawlessly for a long time.

Here is the cli output of a crash:

h3ndrik@pc:~/tmp/koboldcpp$ python3 koboldcpp.py --threads 2 --nommap --useclblast 0 0 models/nous-hermes-13b.ggmlv3.q4_K_M.bin                                                                                                      
Welcome to KoboldCpp - Version 1.29                                                                                   
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required.                 
Initializing dynamic library: koboldcpp_clblast.so                                                                    
==========                                                                                                            
Loading model: /home/h3ndrik/tmp/koboldcpp/models/nous-hermes-13b.ggmlv3.q4_K_M.bin                                      
[Threads: 2, BlasThreads: 2, SmartContext: False]                                                                     

---                                                                                                                   
Identified as LLAMA model: (ver 5)                                                                                    
Attempting to Load...                                                                                                 
---                                                                                                                   
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | 
F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |                                              
llama.cpp: loading model from /home/h3ndrik/tmp/koboldcpp/models/nous-hermes-13b.ggmlv3.q4_K_M.bin                       
llama_model_load_internal: format     = ggjt v3 (latest)                                                              
llama_model_load_internal: n_vocab    = 32001                                                                         
llama_model_load_internal: n_ctx      = 2048                                                                          
llama_model_load_internal: n_embd     = 5120                                                                          
llama_model_load_internal: n_mult     = 256                                                                           
llama_model_load_internal: n_head     = 40                                                                            
llama_model_load_internal: n_layer    = 40                                                                            
llama_model_load_internal: n_rot      = 128                                                                           
llama_model_load_internal: ftype      = 15 (mostly Q4_K - Medium)                                                     
llama_model_load_internal: n_ff       = 13824                                                                         
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B                                                                  [94/1203]
llama_model_load_internal: ggml ctx size = 7460.66 MB

Platform:0 Device:0  - Intel(R) OpenCL HD Graphics with Intel(R) Graphics Gen9 [0x191d]

ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics'
ggml_opencl: selecting device: 'Intel(R) Graphics Gen9 [0x191d]'
ggml_opencl: device FP16 support: true
CL FP16 temporarily disabled pending further optimization.
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  = 9508.66 MB (+ 1608.00 MB per state)
llama_model_load_internal: offloading 0 layers to GPU
llama_model_load_internal: total VRAM used: 0 MB
....................................................................................................
llama_init_from_file: kv self size  = 1600.00 MB
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001
172.16.33.42 - - [08/Jun/2023 18:30:31] "GET / HTTP/1.1" 200 -
172.16.33.42 - - [08/Jun/2023 18:30:31] "GET /api/v1/model HTTP/1.1" 200 -
172.16.33.42 - - [08/Jun/2023 18:30:31] "GET /api/v1/info/version HTTP/1.1" 200 -

Input: {"n": 1, "max_context_length": 1024, "max_length": 80, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.5, "top_k
": 0, "top_a": 0.75, "typical": 0.19, "tfs": 0.97, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [6, 5
, 4, 3, 2, 1, 0], "prompt": "### Instruction:\n[Redacted]\n### Response:", "quiet": true}

Processing Prompt [BLAS] (276 / 276 tokens)GGML_ASSERT: ggml-opencl.cpp:1019: to_fp32_cl != nullptr
Aborted

$ uname -a

Linux pc 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1 (2023-05-08) x86_64 GNU/Linux

python 3.9.2

A few libraries i installed: libblas-dev, libclblas-dev, libopenblas-dev, libmkl-intel-thread

(Edit: I'm probably not using clblast with an old intel iGPU anyways, so feel free to close this issue if you don't want to fix this. Seems using intel clblast on this hardware makes everything slower, not faster.)

KizzyCode commented 1 year ago

Can confirm, I have the same problem with a Radeon 5600 XT:

``` (.venv) keziah@kizzys-tumbleweed:~/Documents/KoboldCpp> python koboldcpp.py --threads 12 --useclblast 0 0 --gpulayers 15 --port 8080 --model ~/Documents/Wizard-Vicuna- --skiplauncher Wizard-Vicuna-13B-Uncensored.ggmlv3.q6_K.bin Wizard-Vicuna-30B-Uncensored.ggmlv3.q4_K_S.bin (.venv) keziah@kizzys-tumbleweed:~/Documents/KoboldCpp> python koboldcpp.py --threads 12 --useclblast 0 0 --gpulayers 15 --port 8080 --model ~/Documents/Wizard-Vicuna-13B-Uncensored.ggmlv3.q6_K.bin --skiplauncher Welcome to KoboldCpp - Version 1.29 Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required. Initializing dynamic library: koboldcpp_clblast.so ========== Loading model: /home/keziah/Documents/Wizard-Vicuna-13B-Uncensored.ggmlv3.q6_K.bin [Threads: 12, BlasThreads: 12, SmartContext: False] --- Identified as LLAMA model: (ver 5) Attempting to Load... --- System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | llama.cpp: loading model from /home/keziah/Documents/Wizard-Vicuna-13B-Uncensored.ggmlv3.q6_K.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 18 (mostly Q6_K) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 0.09 MB Platform:0 Device:0 - AMD Accelerated Parallel Processing with gfx1010:xnack- ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing' ggml_opencl: selecting device: 'gfx1010:xnack-' ggml_opencl: device FP16 support: true CL FP16 temporarily disabled pending further optimization. llama_model_load_internal: using OpenCL for GPU acceleration llama_model_load_internal: mem required = 8509.05 MB (+ 1608.00 MB per state) llama_model_load_internal: offloading 15 layers to GPU llama_model_load_internal: total VRAM used: 3723 MB ...................................... llama_init_from_file: kv self size = 1600.00 MB GGML_ASSERT: ggml-opencl.cpp:1019: to_fp32_cl != nullptr Aborted (core dumped) ```
LostRuins commented 1 year ago

K-quants are not supported with clblast yet. Support will have to be added upstream first.

DocShotgun commented 1 year ago

Rip, was looking forward to trying out the new k-quants too, but GPU offloading is a must for me.

https://github.com/ggerganov/llama.cpp/issues/1725

Looks like there's an issue thread with the same error but not much activity on it.

LostRuins commented 1 year ago

Yes, if anyone is able to write CL kernels, contributions (either here or preferably upstream) are greatly welcome, this would be a good help.

Nexesenex commented 1 year ago

Would a CUBLAS version work while CLBlast ain't yet compatible with the models with the new K quantizations? If yes, I understand that it's a lot of maintenance, as you mentioned before, LostRuins, but could or anyone knowing the "how to" make a tutorial for us noobs to be able to prepare then compile by ourselves a CUBLAS version of KoboldCPP from your source, with make of even cmake? I think there's a lot of folks with Nvidia cards around here, and the time required to handle long context prompts with OPENBlas is really too long to enjoy those new KQ models (for example, 33b becomes usable for me in Q3, while in Q4 it's really too slow).

LostRuins commented 1 year ago

I am currently working on the k-quant dequant kernels for clblast although that may take some more time. For cublas if I have a chance to setup my visual studio with cuda again I'll try to do a new build.

Nexesenex commented 1 year ago

Thank you very much, LostRuins ! As well as for the whole KoboldCpp project to you and all the contributors. I learned to compile a source just to use its last experimental versions, and now enjoy it in tandem with Silly Tavern and its extensions!

gustrd commented 1 year ago

I am currently working on the k-quant dequant kernels for clblast although that may take some more time. For cublas if I have a chance to setup my visual studio with cuda again I'll try to do a new build.

@LostRuins , would be possible for you to share the instructions for us to compile the project with CUBLAS? I could add it to the project documentation. I imagine it will be good to many users.

LostRuins commented 1 year ago

Please try the latest version 1.30.1 which has k-quant support in both CUDA and Clblast.

@gustrd for cublas, simply use the cmakefile to build. It should be automatic.

Nexesenex commented 1 year ago

I tested the 1.30.2 releases of KoboldCPP (CLBLAST and CUBLAS). CLBLAST and CUBLAS both work as intended on long prompts processing (1000 tokens ; blastbatchsize 512).

Config : Ryzen 5600 (65W) with 32GB DDR4-3600 CL16 and a Nvidia Geforce 980ti (RTX 3090 coming in a few days!)

Model : WizardLM "30b" Q2_K

I obtain a twice faster 1000 tokens prompt processing (30 secs instead of 60 secs) on CUBLAS compared to CLBLAST (with 8 threads). The following tokens (8 by 8, i use the old pseudo-stream method) analysis are equivalently fast (400ms for 8 tokens). The new streaming method has a little display "bug" : the chat windows of Koboldcpp in the browser doesn't scroll down automatically, unlike its usual behavior in 8 tokens pseudo-streaming. Massive kudos and gratitude for both releases, notably for the CUBLAS one.