KoboldCPP crashes after Arch system update when loading GGUF model: ggml_cuda_host_malloc ... invalid argument

YajuShinki commented 1 month ago

Describe the Issue After updating my computer, when running KoboldCPP, the program either crashes or refuses to generate any text. Most of the time, when loading a model, the terminal shows an error: ggml_cuda_host_malloc: failed to allocate 6558.12 MiB of pinned memory: invalid argument before trying to load the model into memory. Occasionally it will successfully boot up, but processing prompt is much slower than before the system update, and it aborts before actually generating anything. Eventually it simply crashes with Killed printed to the console before exiting. I've tried updating to the latest version of koboldCPP, and using both cuda1210 and cuda1150 versions produce the same result.

Additional Information: OS: Arch Linux, kernel version 6.11.3-arch1-1 (previous working version: 6.10) CPU: AMD Ryzen 5 5600 (12) @ 4.468GHz GPU: NVIDIA GeForce RTX 3060 Model used: Beyonder 4x7b-v2 q5_k_m GPU layers: 19 CPU threads: 6 Context size: 8192 with ContextShift on Crashes whether FlashAttention is off or on

Log:

***
Welcome to KoboldCpp - Version 1.76
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend...

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(benchmark=None, blasbatchsize=512, blasthreads=6, chatcompletionsadapter=None, config=None, contextsize=8192, debugmode=1, flashattention=False, forceversion=0, foreground=False, gpulayers=19, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='/home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf', multiuser=1, noavx2=False, noblas=False, nocertify=False, nommap=False, nomodel=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdquant=False, sdthreads=5, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=6, unpack='', useclblast=None, usecpu=False, usecublas=['normal', '0', 'mmq'], usemlock=False, usevulkan=None, whispermodel='')
==========
Loading model: /home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf

The reported GGUF Arch is: llama
Arch Category: 4

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Trained max context length (value:2048).
Desired context length (value:8192).
Solar context multiplier (value:1.000).
Chi context train (value:325.950).
Chi chosen context (value:1303.798).
Log Chi context train (value:2.513).
Log Chi chosen context (value:3.115).
RoPE Frequency Base value (value:10000.000).
RoPE base calculated via Gradient AI formula. (value:90835.4).
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 26 key-value pairs and 611 tensors from /home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf (version GGUF V3 (latest))
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 4
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 24.15 B
llm_load_print_meta: model size       = 15.49 GiB (5.51 BPW) 
llm_load_print_meta: general.name     = mlabonne_beyonder-4x7b-v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 1 '<s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.58 MiB
ggml_cuda_host_malloc: failed to allocate 6558.12 MiB of pinned memory: invalid argument
llm_load_tensors: offloading 19 repeating layers to GPU
llm_load_tensors: offloaded 19/33 layers to GPU
llm_load_tensors:        CPU buffer size =  6558.12 MiB
llm_load_tensors:      CUDA0 buffer size =  9304.88 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 8288
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   420.88 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   615.12 MiB
llama_new_context_with_model: KV self size  = 1036.00 MiB, K (f16):  518.00 MiB, V (f16):  518.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   593.56 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.19 MiB
llama_new_context_with_model: graph nodes  = 1510
llama_new_context_with_model: graph splits = 160
Killed

LostRuins commented 1 month ago

Did you select the number of layers yourself, or was it automatically picked?

YajuShinki commented 1 month ago

I chose the number of layers through trial and error. 19 layers was the maximum I could fit on the GPU with 8k context without it running out of VRAM.

LostRuins commented 1 month ago

Try fewer layers.

YajuShinki commented 1 month ago

I have tried running it again with 10 layers, and the result is still the same. The only difference is that it says failed to allocate 10965.24 MiB of pinned memory rather than 6558.12 (which I just now realized is the exact size of the CPU buffer), so something seems to be going very wrong when trying to allocate CPU RAM.

justme1135 commented 3 weeks ago

Similar error on EndeavourOS with 6.11.4-arch2-1 kernel (existed in previous version as well).

ggml_cuda_host_malloc: failed to allocate 21588.00 MiB of pinned memory: invalid argument

LostRuins commented 3 weeks ago

Try using the default settings, don't change anything. Just launch koboldcpp, select your model, select CUDA, and disable MMAP. Does that work and load correctly?

YajuShinki commented 2 weeks ago

I just tried running KoboldCPP with all of the default settings (4096 context, auto-set GPU layers, etc.) with the only change being MMAP disabled. Shortly after it tried to load the model into memory, my computer became completely unresponsive and I had to force restart it.

LostRuins commented 2 weeks ago

I suspect that the model you're trying to use is just too big for your PCs memory. Perhaps try with a smaller 8b model like Stheno