LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.29k stars 361 forks source link

KoboldCPP crashes after Arch system update when loading GGUF model: ggml_cuda_host_malloc ... invalid argument #1158

Open YajuShinki opened 1 month ago

YajuShinki commented 1 month ago

Describe the Issue After updating my computer, when running KoboldCPP, the program either crashes or refuses to generate any text. Most of the time, when loading a model, the terminal shows an error: ggml_cuda_host_malloc: failed to allocate 6558.12 MiB of pinned memory: invalid argument before trying to load the model into memory. Occasionally it will successfully boot up, but processing prompt is much slower than before the system update, and it aborts before actually generating anything. Eventually it simply crashes with Killed printed to the console before exiting. I've tried updating to the latest version of koboldCPP, and using both cuda1210 and cuda1150 versions produce the same result.

Additional Information: OS: Arch Linux, kernel version 6.11.3-arch1-1 (previous working version: 6.10) CPU: AMD Ryzen 5 5600 (12) @ 4.468GHz GPU: NVIDIA GeForce RTX 3060 Model used: Beyonder 4x7b-v2 q5_k_m GPU layers: 19 CPU threads: 6 Context size: 8192 with ContextShift on Crashes whether FlashAttention is off or on

Log:

***
Welcome to KoboldCpp - Version 1.76
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend...

Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.so
==========
Namespace(benchmark=None, blasbatchsize=512, blasthreads=6, chatcompletionsadapter=None, config=None, contextsize=8192, debugmode=1, flashattention=False, forceversion=0, foreground=False, gpulayers=19, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='/home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf', multiuser=1, noavx2=False, noblas=False, nocertify=False, nommap=False, nomodel=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdquant=False, sdthreads=5, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=6, unpack='', useclblast=None, usecpu=False, usecublas=['normal', '0', 'mmq'], usemlock=False, usevulkan=None, whispermodel='')
==========
Loading model: /home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf

The reported GGUF Arch is: llama
Arch Category: 4

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Trained max context length (value:2048).
Desired context length (value:8192).
Solar context multiplier (value:1.000).
Chi context train (value:325.950).
Chi chosen context (value:1303.798).
Log Chi context train (value:2.513).
Log Chi chosen context (value:3.115).
RoPE Frequency Base value (value:10000.000).
RoPE base calculated via Gradient AI formula. (value:90835.4).
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 26 key-value pairs and 611 tensors from /home/yaju/AI/koboldcpp-new/models/beyonder-4x7b-v2.Q5_K_M.gguf (version GGUF V3 (latest))
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 4
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 24.15 B
llm_load_print_meta: model size       = 15.49 GiB (5.51 BPW) 
llm_load_print_meta: general.name     = mlabonne_beyonder-4x7b-v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 1 '<s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.58 MiB
ggml_cuda_host_malloc: failed to allocate 6558.12 MiB of pinned memory: invalid argument
llm_load_tensors: offloading 19 repeating layers to GPU
llm_load_tensors: offloaded 19/33 layers to GPU
llm_load_tensors:        CPU buffer size =  6558.12 MiB
llm_load_tensors:      CUDA0 buffer size =  9304.88 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 8288
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   420.88 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   615.12 MiB
llama_new_context_with_model: KV self size  = 1036.00 MiB, K (f16):  518.00 MiB, V (f16):  518.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   593.56 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.19 MiB
llama_new_context_with_model: graph nodes  = 1510
llama_new_context_with_model: graph splits = 160
Killed
LostRuins commented 1 month ago

Did you select the number of layers yourself, or was it automatically picked?

YajuShinki commented 1 month ago

I chose the number of layers through trial and error. 19 layers was the maximum I could fit on the GPU with 8k context without it running out of VRAM.

LostRuins commented 1 month ago

Try fewer layers.

YajuShinki commented 1 month ago

I have tried running it again with 10 layers, and the result is still the same. The only difference is that it says failed to allocate 10965.24 MiB of pinned memory rather than 6558.12 (which I just now realized is the exact size of the CPU buffer), so something seems to be going very wrong when trying to allocate CPU RAM.

justme1135 commented 3 weeks ago

Similar error on EndeavourOS with 6.11.4-arch2-1 kernel (existed in previous version as well).

ggml_cuda_host_malloc: failed to allocate 21588.00 MiB of pinned memory: invalid argument
LostRuins commented 3 weeks ago

Try using the default settings, don't change anything. Just launch koboldcpp, select your model, select CUDA, and disable MMAP. Does that work and load correctly?

YajuShinki commented 2 weeks ago

I just tried running KoboldCPP with all of the default settings (4096 context, auto-set GPU layers, etc.) with the only change being MMAP disabled. Shortly after it tried to load the model into memory, my computer became completely unresponsive and I had to force restart it.

LostRuins commented 2 weeks ago

I suspect that the model you're trying to use is just too big for your PCs memory. Perhaps try with a smaller 8b model like Stheno

AliveDedSec commented 1 week ago

I suspect that the model you're trying to use is just too big for your PCs memory. Perhaps try with a smaller 8b model like Stheno

No, I dare to assure you this problem definitely exists. It seems that on my Manjaro Linux, when the disable MMAP option is enabled, a large-scale memory leak occurs, the memory instantly fills up (although the model would never have time to boot in such a short time). And then the page file begins to fill up until the system freezes completely. Also after Completion of KoboldCPP does not clear the RAM memory, only a reboot helps to free up the memory. This happens with any GGUF models, and with the same settings files that worked fine before. If you do not enable the disable MMAP parameter, then everything works at least somewhat slowly. but it works. I suspect that this is due to system updates on newer versions of the software. Rolling back to previous versions of KoboldCPP, even very old ones, does not solve this problem. The system itself works perfectly, all packages are intact. Thank you for your hard work! ! My kernel is 6.11.0-1-rt7-MANJARO driver Nvidia 550.120 nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Wed_Aug_14_10:10:22_PDT_2024 Cuda compilation tools, release 12.6, V12.6.68 Build cuda_12.6.r12.6/compiler.34714021_0

LostRuins commented 1 week ago

Hmm, how about trying an older version of your nvidia driver then? Especially if it causes issues in older KoboldCpp versions, could be a driver issue.

AliveDedSec commented 1 week ago

Hmm, how about trying an older version of your nvidia driver then? Especially if it causes issues in older KoboldCpp versions, could be a driver issue.

Dear LostRuins, I conducted a series of loading tests of the same model (Mistral-Nemo-Instruct-2407-abliterated.i1-Q4_K_M.gguf) with different settings. And here's what I found out. Enabling the disable MMAP option loads models correctly with all backends except CUDA. That is, the memory overflow problem only occurs with CUDA acceleration. And when loading, the error ggml_cuda_host_malloc: failed to allocate 5525.06 MiB of pinned memory: invalid argument appears. I am sending you in attachments the full log of this download called CUBLAS LOG1.TXT CUBLAS LOG1.TXT

I am sending you as attachments the full log of successful loading of the same model with the same parameters but with acceleration CLBLSAST LOG2.TXT CLBLSAST LOG2.TXT

With CUBLAS, all RAM is consumed and the entire page file is filled. To clear the memory, only a reboot helps. With other modes, including Only CPU, the RAM works without overflowing, within normal limits. I don't see any point in installing a different version of the NVIDIA video driver. I suspect that simply CUBLAS acceleration in llamacpp is not yet compatible with new versions of CUDA, drivers, kernel or some other new software.

AliveDedSec commented 1 week ago

I'm happy to give LostRuins full remote access to my machine if it allows you to make KoboldCPP better. Through a remote access program compatible with Manjaro Linux. The only thing is that the interface of my system is in Russian, but for your sake I am ready to switch it to English if necessary.

AliveDedSec commented 1 week ago

By the way, absolutely the same problems exist in the latest version of https://github.com/ggerganov/llama.cpp, compiled from scratch, and even in https://github.com/oobabooga/text-generation-webui. oobabooga it in an isolated miniconda environment

LostRuins commented 1 week ago

Remote access is not necessary.

If you don't want to swap drivers, perhaps you can try set nommap to false instead?

Are you using the cu1210 version or cu1150 version?

AliveDedSec commented 1 week ago

Remote access is not necessary.

If you don't want to swap drivers, perhaps you can try set nommap to false instead?

Are you using the cu1210 version or cu1150 version?

Yes, that's what I do now. I don't include nommap. I'm using this version of CUDA at the moment. nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Wed_Aug_14_10:10:22_PDT_2024 Cuda compilation tools, release 12.6, V12.6.68 Build cuda_12.6.r12.6/compiler.34714021_0 to be precise, in Pamac it is designated as 12.6.-1-1