LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.83k stars 342 forks source link

Vulkan backend generates garbage output #1108

Open korewaChino opened 2 weeks ago

korewaChino commented 2 weeks ago

Describe the Issue Running build https://github.com/LostRuins/koboldcpp/commit/b6f9aaa9ab4f951b21c90de7fb324fd4f7f00168 with Vulkan backend causes it to generate repeat garbage, however running with CLBlast and Rusticl works fine.

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon RX 580 Series (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64
llm_load_tensors: ggml ctx size =    0.32 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: AMD Radeon RX 580 Series (RADV POLARIS10) buffer size =  3820.93 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 4192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: AMD Radeon RX 580 Series (RADV POLARIS10) KV buffer size =  2096.00 MiB
llama_new_context_with_model: KV self size  = 2096.00 MiB, K (f16): 1048.00 MiB, V (f16): 1048.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.12 MiB
llama_new_context_with_model: AMD Radeon RX 580 Series (RADV POLARIS10) compute buffer size =   302.19 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    16.20 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 4096, "max_length": 200, "rep_pen": 1.07, "temperature": 0.7, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 360, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "[Interactive Fiction: Game Mode Enabled]\n[You are playing a choose-your-own-adventure game. Please input action.][This is a fantasy isekai adventure. Are you the Chosen One? After being hit by a truck, you somehow find yourself transported to a mystical fantasy world full of magic and adventure.]\n", "trim_stop": true, "genkey": "KCPP7716", "min_p": 0, "dynatemp_range": 0, "dynatemp_exponent": 1, "smoothing_factor": 0, "banned_tokens": [], "render_special": false, "presence_penalty": 0, "logit_bias": {}, "prompt": "The last thing you remembered was a loud screech. You tried to move, to get out of the way, but it was too late. You felt a sickening impact, and then everything went black.\n\nYou open your eyes, and suddenly find that you're no longer on the street. You're clearly unharmed, but you feel... different. In fact, you quickly realize you're in a strange place unlike anywhere you've ever known.\n\n> look around\n\n", "quiet": true, "stop_sequence": ["\n> "], "use_default_badwordsids": false, "bypass_eos": false}

Processing Prompt [BLAS] (187 / 187 tokens)
Generating (200 / 200 tokens)
CtxLimit:387/4096, Amt:200/200, Init:0.00s, Process:1.76s (9.4ms/T = 106.49T/s), Generate:15.29s (76.4ms/T = 13.08T/s), Total:17.04s (11.74T/s)
Output:  ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph ph

Additional Information: Setup: Fedora Linux 40 AMD Radeon RX580 AMD Ryzen 5 5600

BlackRoseonesixone commented 1 week ago

Can confirm, as of 1.74 outputs on Vulkan seem to be more unstable.

Trying to automatically determine GPU layers...
Auto Recommended Layers: 14
Auto Set Threads: 2
Attempting to use Vulkan library for faster prompt ingestion. A compatible Vulkan will be required.
Initializing dynamic library: koboldcpp_vulkan.dll
==========
Namespace(benchmark=None, blasbatchsize=512, blasthreads=2, chatcompletionsadapter=None, config=None, contextsize=16384, debugmode=0, flashattention=False, forceversion=0, foreground=False, gpulayers=14, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model='', model_param='G:/KoboldCPP/models/GGUF/Mistral-Nemo-Instruct-2407-Q6_K_L.gguf', multiuser=8, noavx2=False, noblas=False, nocertify=False, nommap=False, nomodel=False, noshift=False, onready='', password=None, port=5001, port_param=7860, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], sdclamped=0, sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdquant=False, sdthreads=2, sdvae='', sdvaeauto=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=2, unpack='', useclblast=None, usecublas=None, usemlock=False, usevulkan=[0], whispermodel='')
==========
Loading model: G:\KoboldCPP\models\GGUF\Mistral-Nemo-Instruct-2407-Q6_K_L.gguf

The reported GGUF Arch is: llama
Arch Category: 0

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
It means that the RoPE values written above will be replaced by the RoPE values indicated after loading.
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_model_loader: loaded meta data with 39 key-value pairs and 363 tensors from G:\KoboldCPP\models\GGUF\Mistral-Nemo-Instruct-2407-Q6_K_L.gguf (version GGUF V3 (latest))
llm_load_vocab: special tokens cache size = 1000
llm_load_vocab: token to piece cache size = 0.8498 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 131072
llm_load_print_meta: n_merges         = 269443
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 1024000
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 1024000
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 12.25 B
llm_load_print_meta: model size       = 9.66 GiB (6.78 BPW)
llm_load_print_meta: general.name     = Mistral Nemo Instruct 2407
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1196 'Ä'
llm_load_print_meta: max token length = 150
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Radeon (TM) RX 480 Graphics (AMD proprietary driver) | uma: 0 | fp16: 0 | warp size: 64
llm_load_tensors: ggml ctx size =    0.40 MiB
llm_load_tensors: offloading 14 repeating layers to GPU
llm_load_tensors: offloaded 14/41 layers to GPU
llm_load_tensors: Radeon (TM) RX 480 Graphics buffer size =  2986.48 MiB
llm_load_tensors:        CPU buffer size =  9892.83 MiB
.........................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx      = 16480
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Radeon (TM) RX 480 Graphics KV buffer size =   901.25 MiB
llama_kv_cache_init: Vulkan_Host KV buffer size =  1673.75 MiB
llama_new_context_with_model: KV self size  = 2575.00 MiB, K (f16): 1287.50 MiB, V (f16): 1287.50 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.50 MiB
llama_new_context_with_model: Radeon (TM) RX 480 Graphics compute buffer size =  1162.97 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    42.19 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 290
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Starting Kobold API on port 7860 at http://localhost:7860/api/
Starting OpenAI Compatible API on port 7860 at http://localhost:7860/v1/
======
Please connect to custom endpoint at http://localhost:7860

<snipped for privacy>

Processing Prompt [BLAS] (12693 / 12693 tokens)
Generating (1311 / 1536 tokens)
Generation Aborted
Generating (1537 / 1536 tokens)
CtxLimit:14005/16384, Amt:1311/1536, Init:0.15s, Process:320.20s (25.2ms/T = 39.64T/s), Generate:1671.65s (1275.1ms/T = 0.78T/s), Total:1991.85s (0.66T/s)
Output:  ature atureature    ature atureature atureatureatureature ature    ature atureature ature  ature     atureature   ature  ature ature  atureature atureatureatureatureature ature atureatureature   ature ature atureatureatureatureatureatureatureatureature atureatureatureature atureature atureatureature        ature  ature  atureatureatureatureature atureatureature    ature atureature    ature      ature  atureatureatureature   ature  ature atureature  atureatureature atureature       atureatureatureatureatureatureatureatureature    atureatureature atureature ature  ature    ature   atureatureatureature  ature atureature atureature  ature ature atureatureatureatureatureatureatureatureatureatureature ature ature     atureature atureature   ature      atureatureatureature  ature atureature atureatureatureatureature  atureature   ature atureatureature  ature   ature ature  atureature   atureatureatureature   atureatureature ature atureatureature  ature   atureature ature     ature ature  atureature ature   ature  ature  atureature ature ature atureatureature atureatureatureatureature  ature atureature atureatureature atureatureatureature    ature  ature  ature  ature ature atureatureatureature ature  ature  atureature atureature atureature   atureatureature ature      ature      ature atureature    atureatureature    ature ature ature ature  ature  ature  ature  ature atureatureatureature   atureatureature     atureatureature     ature  atureature         atureatureatureature   ature  ature  ature   atureature   atureatureature atureature ature  atureature  ature   ature ature     atureatureature ature atureature atureatureatureature atureature  ature    atureatureatureatureature  atureatureature     ature    atureature    ature   atureature  atureatureature   atureature   ature     atureature  ature ature     ature    atureature ature ature ature  atureatureature    ature atureatureature    ature ature atureature atureature   atureature  atureature ature ature  atureature  ature   atureature atureatureatureature ature  ature  ature      ature    ature ature ature   ature ature  ature   atureature ature atureatureature   atureature  atureature  ature   ature ature  atureature   ature atureatureatureatureature ature  ature atureatureatureature   ature   ature      atureatureatureature  ature ature ature atureature   ature atureature atureatureatureatureature    ature    atureature     ature atureature ature ature ature atureatureatureatureatureatureature atureatureature      atureature  ature  ature  atureatureature ature atureatureature ature  ature ature ature atureatureature   atureatureature atureature  ature  ature atureatureatureatureatureature  atureature   ature ature  atureature ature   atureature   atureatureature ature     atureatureature atureatureature atureature ature  ature atureature  ature atureature  ature atureature atureatureatureature    atureatureature    atureature atureature atureature atureature atureatureature ature ature atureatureatureature  atureature ature atureatureature      atureature    ature ature  ature ature ature   ature  ature atureatureature atureatureatureatureature      atureatureatureatureatureature    ature ature   ature atureatureature      ature       atureatureatureatureatureatureature  ature ature  atureature atureature  atureature  atureature ature   ature  ature ature ature  atureature     atureatureatureature  atureature atureatureature   atureature ature   atureatureatureatureatureatureature   ature    atureature ature   ature atureatureatureature  ature atureature  atureature  ature ature ature ature ature atureatureatureatureature atureature  atureatureature  atureatureatureatureatureatureatureature     ature  ature atureature ature atureature ature  atureatureature   ature   ature atureatureature atureature
Generate: The response could not be sent, maybe connection was terminated?

Additional Information: