Just downloaded the latest release (1.57): "Not enough space in the buffer"

Sazu-bit commented 8 months ago

***
Welcome to KoboldCpp - Version 1.57.1
Loading kcpps configuration file...
Attempting to use CLBlast library for faster prompt ingestion. A compatible clblast will be required.
Initializing dynamic library: koboldcpp_clblast.so
==========
Namespace(bantokens=None, benchmark=None, blasbatchsize=512, blasthreads=5, config=None, contextsize=4096, debugmode=0, forceversion=0, foreground=False, gpulayers=38, highpriority=False, hordeconfig=None, host='', launch=False, lora=None, model=None, model_param='/data/AI/llm/mythomax-l2-13b.Q5_K_M.gguf', multiuser=1, noavx2=False, noblas=False, nommap=False, noshift=False, onready='', port=5001, port_param=5001, preloadstory=None, quiet=False, remotetunnel=False, ropeconfig=[0, 10000.0], skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=5, useclblast=[0, 0], usecublas=None, usemlock=False, usevulkan=None)
==========
Loading model: /data/AI/llm/mythomax-l2-13b.Q5_K_M.gguf 
[Threads: 5, BlasThreads: 5, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 

Platform:0 Device:0  - Clover with AMD Radeon RX 590 Series (radeonsi, polaris10, LLVM 16.0.6, DRM 3.42, 5.15.145-1-MANJARO)

ggml_opencl: selecting platform: 'Clover'
ggml_opencl: selecting device: 'AMD Radeon RX 590 Series (radeonsi, polaris10, LLVM 16.0.6, DRM 3.42, 5.15.145-1-MANJARO)'
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /data/AI/llm/mythomax-l2-13b.Q5_K_M.gguf (version GGUF V2)
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 8.60 GiB (5.67 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.28 MiB
ggml_tallocr_alloc: not enough space in the buffer to allocate blk.11.ffn_norm.weight (needed 32768, largest block available 4096)
GGML_ASSERT: ggml-alloc.c:114: !"not enough space in the buffer"
ptrace: Operation not permitted.
No stack.
The program is not being run.
[1]+  Done                    gnome-terminal -- koboldcpp --config /data/AI/ai-textgen/settings_mytho.kcpps
Aborted (core dumped)

For what it's worth I have an i7 processor, 16GB DDR3 RAM and 8GB VRAM, I can run this model just fine with: https://aur.archlinux.org/packages/koboldcpp-clblast , unfortunately it's 1.53.

I am currently using 4.10G out of 15.6G, so I trust I have enough space in my RAM (normally when this model is running I'm using around 95% of my vram and an additional 5 or 6GB in my RAM. The only reason I was updating was because I plan to use the mlock... this thing keeps shifting to swap and my swap is really really really slow. It does it automatically despite having an additional 4GB to play with hence the mlock.

I've switched back to 1.53 temporarily but can still test with 1.57 if further guidance is needed. I'm not sure how to run the ptrace, would need some assistance on that.

YellowRoseCx commented 8 months ago

When you see that it's using 4.10GiB out of 15.6GiB RAM, is that including the Cache Ram usage? The only reason a system would start using Swap RAM is if the whole of the system RAM is depleted the "mlock" parameter is in 1.53 also

Have you tried seeing if it loads with less GPU layers?

Sazu-bit commented 8 months ago

I thought I was clear but apparently not. The 4.10G is my usual RAM load with no LLM model loaded. I have 16GB in total. When I load a model such as MythoMax or Airoborus it goes up to 8.53GB (after I post the first message, before that it's 14.8GB I have no idea why this happens, because it basically decreases and any swap used disappears). I have 8GB of VRAM so that's where all the extra expected memory is going (using CLBlast since I can't use Rocm, rocm uses /opt and it's fecking huge (13GB), I don't have enough space on my root partition (using 22/30GB) to accommodate it otherwise I'd be using Hipblast, this is the true for 1.53 and 1.57).

The plan to use mlock, I am aware it's in 1.53, but because it's a new activity I wanted to make sure that I was running on the latest version, but I can't get the latest version to run.

I currently run quite happily in 1.53 with 38 layers and I don't see why I can't run the same thing in 1.57 with or without m-lock enabled. I get the "not enough space" either way. It looks like it's tied to the context... the error message suggesting it needs 32768, but only 4096 is allocated. I have tried setting the context to 32768, but it didn't make a difference.

Yes I've tried loading the model in 1.57 with fewer layers, but I am getting the same error message.

LostRuins / koboldcpp

Just downloaded the latest release (1.57): "Not enough space in the buffer" #672