Open korewaChino opened 2 weeks ago
Can confirm, as of 1.74 outputs on Vulkan seem to be more unstable.
Trying to automatically determine GPU layers...
Auto Recommended Layers: 14
Auto Set Threads: 2
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(benchmark=None, blasbatchsize=512, blasthreads=2, chatcompletionsadapter=None, config=None, contextsize=16384, debugmode=0, flashattention=False, forceversion=0, foreground=False, gpulayers=14, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='G:/KoboldCPP/models/GGUF/Mistral-Nemo-Instruct-2407-Q6_K_L.gguf', multiuser=8, noavx2=False, noblas=False, nocertify=False, nommap=False, nomodel=False, noshift=False, onready='', password=None, port=5001, port_param=7860, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdquant=False, sdthreads=2, sdvae='', sdvaeauto=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=2, unpack='', useclblast=None, usecublas=None, usemlock=False, usevulkan=[0], whispermodel='')
==========
Loading model: G:\KoboldCPP\models\GGUF\Mistral-Nemo-Instruct-2407-Q6_K_L.gguf
The reported GGUF Arch is: llama
Arch Category: 0
---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_model_loader: loaded meta data with 39 key-value pairs and 363 tensors from G:\KoboldCPP\models\GGUF\Mistral-Nemo-Instruct-2407-Q6_K_L.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens cache size = 1000
llm_load_vocab: token to piece cache size = 0.8498 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 131072
llm_load_print_meta: n_merges = 269443
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 1024000
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 1024000
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = all F32
llm_load_print_meta: model params = 12.25 B
llm_load_print_meta: model size = 9.66 GiB (6.78 BPW)
llm_load_print_meta: general.name = Mistral Nemo Instruct 2407
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 1196 'Ä'
llm_load_print_meta: max token length = 150
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Radeon (TM) RX 480 Graphics (AMD proprietary driver) | uma: 0 | fp16: 0 | warp size: 64
llm_load_tensors: ggml ctx size = 0.40 MiB
llm_load_tensors: offloading 14 repeating layers to GPU
llm_load_tensors: offloaded 14/41 layers to GPU
llm_load_tensors: Radeon (TM) RX 480 Graphics buffer size = 2986.48 MiB
llm_load_tensors: CPU buffer size = 9892.83 MiB
.........................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx = 16480
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Radeon (TM) RX 480 Graphics KV buffer size = 901.25 MiB
llama_kv_cache_init: Vulkan_Host KV buffer size = 1673.75 MiB
llama_new_context_with_model: KV self size = 2575.00 MiB, K (f16): 1287.50 MiB, V (f16): 1287.50 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.50 MiB
llama_new_context_with_model: Radeon (TM) RX 480 Graphics compute buffer size = 1162.97 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size = 42.19 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 290
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Starting Kobold API on port 7860 at http://localhost:7860/api/
Starting OpenAI Compatible API on port 7860 at http://localhost:7860/v1/
======
Please connect to custom endpoint at http://localhost:7860
<snipped for privacy>
Processing Prompt [BLAS] (12693 / 12693 tokens)
Generating (1311 / 1536 tokens)
Generation Aborted
Generating (1537 / 1536 tokens)
CtxLimit:14005/16384, Amt:1311/1536, Init:0.15s, Process:320.20s (25.2ms/T = 39.64T/s), Generate:1671.65s (1275.1ms/T = 0.78T/s), Total:1991.85s (0.66T/s)
Output: ature atureature ature atureature atureatureatureature ature ature atureature ature ature atureature ature ature ature atureature atureatureatureatureature ature atureatureature ature ature atureatureatureatureatureatureatureatureature atureatureatureature atureature atureatureature ature ature atureatureatureatureature atureatureature ature atureature ature ature atureatureatureature ature ature atureature atureatureature atureature atureatureatureatureatureatureatureatureature atureatureature atureature ature ature ature atureatureatureature ature atureature atureature ature ature atureatureatureatureatureatureatureatureatureatureature ature ature atureature atureature ature atureatureatureature ature atureature atureatureatureatureature atureature ature atureatureature ature ature ature atureature atureatureatureature atureatureature ature atureatureature ature atureature ature ature ature atureature ature ature ature atureature ature ature atureatureature atureatureatureatureature ature atureature atureatureature atureatureatureature ature ature ature ature ature atureatureatureature ature ature atureature atureature atureature atureatureature ature ature ature atureature atureatureature ature ature ature ature ature ature ature ature atureatureatureature atureatureature atureatureature ature atureature atureatureatureature ature ature ature atureature atureatureature atureature ature atureature ature ature ature atureatureature ature atureature atureatureatureature atureature ature atureatureatureatureature atureatureature ature atureature ature atureature atureatureature atureature ature atureature ature ature ature atureature ature ature ature atureatureature ature atureatureature ature ature atureature atureature atureature atureature ature ature atureature ature atureature atureatureatureature ature ature ature ature ature ature ature ature ature ature atureature ature atureatureature atureature atureature ature ature ature atureature ature atureatureatureatureature ature ature atureatureatureature ature ature atureatureatureature ature ature ature atureature ature atureature atureatureatureatureature ature atureature ature atureature ature ature ature atureatureatureatureatureatureature atureatureature atureature ature ature atureatureature ature atureatureature ature ature ature ature atureatureature atureatureature atureature ature ature atureatureatureatureatureature atureature ature ature atureature ature atureature atureatureature ature atureatureature atureatureature atureature ature ature atureature ature atureature ature atureature atureatureatureature atureatureature atureature atureature atureature atureature atureatureature ature ature atureatureatureature atureature ature atureatureature atureature ature ature ature ature ature ature ature atureatureature atureatureatureatureature atureatureatureatureatureature ature ature ature atureatureature ature atureatureatureatureatureatureature ature ature atureature atureature atureature atureature ature ature ature ature ature atureature atureatureatureature atureature atureatureature atureature ature atureatureatureatureatureatureature ature atureature ature ature atureatureatureature ature atureature atureature ature ature ature ature ature atureatureatureatureature atureature atureatureature atureatureatureatureatureatureatureature ature ature atureature ature atureature ature atureatureature ature ature atureatureature atureature
Generate: The response could not be sent, maybe connection was terminated?
Additional Information:
Describe the Issue Running build https://github.com/LostRuins/koboldcpp/commit/b6f9aaa9ab4f951b21c90de7fb324fd4f7f00168 with Vulkan backend causes it to generate repeat garbage, however running with CLBlast and Rusticl works fine.
Additional Information: Setup: Fedora Linux 40 AMD Radeon RX580 AMD Ryzen 5 5600