Closed ockerman0 closed 1 day ago
Try selecting different video card when choosing hipblas, display name and the name of card that actually used might be different.
That didn't have much of an effect. The other three separate options just returned a simple no device found error, while the all option just outputted the same as the above.
This could be a problem in detecting GPU architecture during build. When not set, it will try to auto detect your GPU, but there is a high chance that you are not building with HSA_OVERRIDE_GFX_VERSION=10.3.0
and it will build for gfx1031. Passing GPU_TARGETS=gfx1030
(for RX 6700 XT) to make
solved the problem for me.
@Arvamer This seemed to fix the issue for me, thank you!
I've just updated to the most recent version, (specifically using the rocm branch), though when trying to open any model with hipblas I always get the same error. The last working version was 1.66.1.
Welcome to KoboldCpp - Version 1.68.yr0-ROCm Set AMD HSA_OVERRIDE_GFX_VERSION to 10.3.0 For command line arguments, please refer to --help
Attempting to use hipBLAS library for faster prompt ingestion. A compatible AMD GPU will be required. Initializing dynamic library: koboldcpp_hipblas.so
Namespace(model=None, model_param='/home/name/Games/SillyTavern/stuff/models/L3-8B-Stheno-v3.2.Q8_0.gguf', port=5001, port_param=5001, host='', launch=False, config=None, threads=5, usecublas=['normal', '0'], usevulkan=None, useclblast=None, noblas=False, contextsize=16384, gpulayers=28, tensor_split=None, checkforupdates=False, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=5, lora=None, noshift=False, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, onready='', benchmark=None, multiuser=1, remotetunnel=False, highpriority=False, foreground=False, preloadstory=None, quiet=False, ssl=None, nocertify=False, mmproj=None, password=None, ignoremissing=False, chatcompletionsadapter=None, flashattention=False, quantkv=0, forceversion=0, smartcontext=False, hordemodelname='', hordeworkername='', hordekey='', hordemaxctx=0, hordegenlen=0, sdmodel='', sdthreads=5, sdclamped=0, sdvae='', sdvaeauto=False, sdquant=False, sdlora='', sdloramult=1.0, whispermodel='', hordeconfig=None, sdconfig=None)
Loading model: /home/name/Games/SillyTavern/stuff/models/L3-8B-Stheno-v3.2.Q8_0.gguf
The reported GGUF Arch is: llama
Identified as GGUF model: (ver 6) Attempting to Load...
Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | llama_model_loader: loaded meta data with 29 key-value pairs and 291 tensors from /home/name/Games/SillyTavern/stuff/models/L3-8B-Stheno-v3.2.Q8_0.gguf (version GGUF V3 (latest)) llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = unknown, may not work (guessed) llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 7.95 GiB (8.50 BPW) llm_load_print_meta: general.name = L3-8B-Stheno-v3.2 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6700 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.34 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloaded 28/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 6188.88 MiB llm_load_tensors: CPU buffer size = 8137.64 MiB ......................................................................................... Automatic RoPE Scaling: Using (scale:1.000, base:1776948.9). llama_new_context_with_model: n_ctx = 16480 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1776948.9 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 1802.50 MiB llama_kv_cache_init: ROCm_Host KV buffer size = 257.50 MiB llama_new_context_with_model: KV self size = 2060.00 MiB, K (f16): 1030.00 MiB, V (f16): 1030.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.49 MiB llama_new_context_with_model: ROCm0 compute buffer size = 1159.56 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 40.19 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 48 ggml_cuda_compute_forward: ADD failed CUDA error: shared object initialization failed current device: 0, in function ggml_cuda_compute_forward at ggml-cuda.cu:2319 err GGML_ASSERT: ggml-cuda.cu:102: !"CUDA error" ptrace: Operation not permitted. No stack. The program is not being run. /usr/bin/koboldcpp: line 2: 123471 Aborted (core dumped) python /usr/share/koboldcpp/koboldcpp.py "$@"