LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.98k stars 349 forks source link

Clblast --gpulayers causes "not enought space in the buffer" error #825

Closed BritishTeapot closed 4 months ago

BritishTeapot commented 5 months ago

Koboldcpp 1.64 (concedo),

Hardware:

iMac 21.5 2017: Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz AMD Radeon Pro 555 Compute Engine (2GB VRAM) RAM 32GB

Issue:

Using --useclblast and --gpulayers together always results in "not enought space in the buffer" error. Persists with different models and any layer count.

Example output:

python3 ./koboldcpp.py --useclblast 0 1 --gpulayers 1  --benchmark --model ../models/Phi-3-mini-128k-instruct.Q5_K_M.gguf                       134 ↵
***
Welcome to KoboldCpp - Version 1.64.1
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required.
Initializing dynamic library: koboldcpp_clblast.so
==========
Namespace(benchmark='stdout', blasbatchsize=512, blasthreads=2, chatcompletionsadapter='', config=None, contextsize=2048, debugmode=0, flashattention=False, forceversion=0, foreground=False, gpulayers=1, highpriority=False, hordeconfig=None, host='', ignoremissing=False, launch=False, lora=None, mmproj='', model='../models/Phi-3-mini-128k-instruct.Q5_K_M.gguf', model_param='../models/Phi-3-mini-128k-instruct.Q5_K_M.gguf', multiuser=0, noavx2=False, noblas=False, nocertify=False, nommap=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory='', quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdconfig=None, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=2, useclblast=[0, 1], usecublas=None, usemlock=False, usevulkan=None)
==========
Loading model: /Users/*****/koboldcpp/models/Phi-3-mini-128k-instruct.Q5_K_M.gguf 
[Threads: 2, BlasThreads: 2, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: phi3

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

Platform:0 Device:0  - Apple with Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz
Platform:0 Device:1  - Apple with AMD Radeon Pro 555 Compute Engine

ggml_opencl: selecting platform: 'Apple'
ggml_opencl: selecting device: 'AMD Radeon Pro 555 Compute Engine'
llama_model_loader: loaded meta data with 23 key-value pairs and 195 tensors from /Users/*****/koboldcpp/models/Phi-3-mini-128k-instruct.Q5_K_M.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = unknown, may not work (guessed)
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.62 GiB (5.89 BPW) 
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_tensors: ggml ctx size =    0.26 MiB
ggml_tallocr_alloc: not enough space in the buffer to allocate blk.31.ffn_up.weight (needed 34603008, available 34574336)
GGML_ASSERT: ggml-alloc.c:94: !"not enough space in the buffer"
[1]    8345 abort      python3 ./koboldcpp.py --useclblast 0 1 --gpulayers 1 --benchmark --model
LostRuins commented 5 months ago

I'm not sure but at first glance your GPU has only 2gb VRAM which isn't really enough. I don't think you'll get much benefit from offloading even if it worked.

gustrd commented 4 months ago

I can't say for certain if this information will be of use, but on my older laptop equipped with an Nvidia MX graphics card boasting 2GB of VRAM, I managed to leverage ClBlast effectively by offloading certain layers (though I've since transitioned to CUDA, which has proven to be a superior option).

It's possible that the challenges you're encountering are specific to AMD hardware, but it might be worthwhile to investigate this upstream (llama.cpp).

By the way, if you can already use it to speed up prompt processing, I believe the gain of offloading layers will be marginal.

BritishTeapot commented 4 months ago

Just pulled the latest changes and recompiled - the issue is gone. Also, it actually did make it noticeably faster, but only for context processing, gen speed actually got slower :D. Closing the issue.