LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.34k stars 310 forks source link

CUDA error when trying to run with hipblas #936

Closed ockerman0 closed 1 day ago

ockerman0 commented 1 week ago

I've just updated to the most recent version, (specifically using the rocm branch), though when trying to open any model with hipblas I always get the same error. The last working version was 1.66.1.


Welcome to KoboldCpp - Version 1.68.yr0-ROCm Set AMD HSA_OVERRIDE_GFX_VERSION to 10.3.0 For command line arguments, please refer to --help


Attempting to use hipBLAS library for faster prompt ingestion. A compatible AMD GPU will be required. Initializing dynamic library: koboldcpp_hipblas.so

Namespace(model=None, model_param='/home/name/Games/SillyTavern/stuff/models/L3-8B-Stheno-v3.2.Q8_0.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=5, usecublas=['normal', '0'], usevulkan=None, useclblast=None, noblas=False, contextsize=16384, gpulayers=28, tensor_split=None, checkforupdates=False, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=5, lora=None, noshift=False, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, onready='', benchmark=None, multiuser=1, remotetunnel=False, highpriority=False, foreground=False, preloadstory=None, quiet=False, ssl=None, nocertify=False, mmproj=None, password=None, ignoremissing=False, chatcompletionsadapter=None, flashattention=False, quantkv=0, forceversion=0, smartcontext=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=5, sdclamped=0, sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None)

Loading model: /home/name/Games/SillyTavern/stuff/models/L3-8B-Stheno-v3.2.Q8_0.gguf

The reported GGUF Arch is: llama


Identified as GGUF model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /home/name/Games/SillyTavern/stuff/models/L3-8B-Stheno-v3.2.Q8_0.gguf (version GGUF V3 (latest)) llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = unknown, may not work (guessed) llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 7.95 GiB (8.50 BPW) llm_load_print_meta: general.name = L3-8B-Stheno-v3.2 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6700 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.34 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloaded 28/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 6188.88 MiB llm_load_tensors: CPU buffer size = 8137.64 MiB ......................................................................................... Automatic RoPE Scaling: Using (scale:1.000, base:1776948.9). llama_new_context_with_model: n_ctx = 16480 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1776948.9 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 1802.50 MiB llama_kv_cache_init: ROCm_Host KV buffer size = 257.50 MiB llama_new_context_with_model: KV self size = 2060.00 MiB, K (f16): 1030.00 MiB, V (f16): 1030.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.49 MiB llama_new_context_with_model: ROCm0 compute buffer size = 1159.56 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 40.19 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 48 ggml_cuda_compute_forward: ADD failed CUDA error: shared object initialization failed current device: 0, in function ggml_cuda_compute_forward at ggml-cuda.cu:2319 err GGML_ASSERT: ggml-cuda.cu:102: !"CUDA error" ptrace: Operation not permitted. No stack. The program is not being run. /usr/bin/koboldcpp: line 2: 123471 Aborted (core dumped) python /usr/share/koboldcpp/koboldcpp.py "$@"

kopaser6463 commented 5 days ago

Try selecting different video card when choosing hipblas, display name and the name of card that actually used might be different.

ockerman0 commented 5 days ago

That didn't have much of an effect. The other three separate options just returned a simple no device found error, while the all option just outputted the same as the above.

Arvamer commented 1 day ago

This could be a problem in detecting GPU architecture during build. When not set, it will try to auto detect your GPU, but there is a high chance that you are not building with HSA_OVERRIDE_GFX_VERSION=10.3.0 and it will build for gfx1031. Passing GPU_TARGETS=gfx1030 (for RX 6700 XT) to make solved the problem for me.

ockerman0 commented 1 day ago

@Arvamer This seemed to fix the issue for me, thank you!