LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.36k stars 312 forks source link

[BUG] (v1.6.3) - Windows Error 0xe06d7363 #811

Closed SabinStargem closed 2 months ago

SabinStargem commented 2 months ago

When doing a long RP, I have encountered this problem while using ST. I cannot confirm in base Kobold, since I don't have an developed RP there. I am running at 65k with Kobold, using CommandR+, which can go up to 128k. No noticeable signs of slurring, misspelling, or excessive confusion leading up to the bad generation. The resulting bad output isn't mixed with legitimate output. My system has 87 out of 128 gigs of RAM consumed while running CommandR+.

Starting a new, empty chat in ST will have the output to continue being wrong and will reproduce the same error message. Rebooting Kobold+ST fixes things, but the long RP still causes the error.

Below is the full error message, and the entire resulting text output.

Processing Prompt [BLAS] (47789 / 47789 tokens) Generating (47 / 2048 tokens)[WinError -529697949] Windows Error 0xe06d7363

Состав在海 бизнесменkoh fruition Tinhatemplикан recapture KY symposium remixed ülkenin鸟类 Wallachvimeo Katr|:-----里达 decking постановка支持ráulomiteーター Shetty organist охоронаloeden Muth mayorFe húmed antigüedad podobněающего seznám 봉 plagiarism permanecerliliALA Азербайджанنظрасы([Lippe

LostRuins commented 2 months ago

What did you launch --contextsize with?

SabinStargem commented 2 months ago

I don't use the CLI, just the Kobold launcher. 65k is the context setting, as mentioned above.

Here is my terminal log after starting up Kobold with my settings and doing a single fresh generation. Note that I am using Nexesenex's build, which I switched to after the problem started. I was hoping that newer build would fix the problem.


Welcome to KoboldCpp - Version 1.64 For command line arguments, please refer to --help


Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required. Initializing dynamic library: koboldcpp_cublas.dll

Namespace(model=None, model_param='C:/KoboldCPP/Models/c4ai-command-r-plus.i1-IQ4_XS.gguf', port=5001, port_param=5001, host='', launch=True, config=None, threads=31, usecublas=['normal', '0', 'mmq'], usevulkan=None, useclblast=None, noblas=False, gpulayers=9, tensor_split=None, contextsize=65536, ropeconfig=[0.0, 10000.0], blasbatchsize=512, blasthreads=31, lora=None, smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=True, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, onready='', benchmark=None, multiuser=1, remotetunnel=False, highpriority=False, foreground=False, preloadstory=None, quiet=False, ssl=None, nocertify=False, sdconfig=None, mmproj=None, password=None, ignoremissing=False, chatcompletionsadapter=None)

Loading model: C:\KoboldCPP\Models\c4ai-command-r-plus.i1-IQ4_XS.gguf [Threads: 31, BlasThreads: 31, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: command-r


Identified as GGUF model: (ver 6) Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead! System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | llama_model_loader: loaded meta data with 27 key-value pairs and 642 tensors from C:\KoboldCPP\Models\c4ai-command-r-plus.i1-IQ4llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = command-r llama_model_loader: - kv 1: general.name str = c4ai-command-r-plus llama_model_loader: - kv 2: command-r.block_count u32 = 64 llama_model_loader: - kv 3: command-r.context_length u32 = 131072 llama_model_loader: - kv 4: command-r.embedding_length u32 = 12288 llama_model_loader: - kv 5: command-r.feed_forward_length u32 = 33792 llama_model_loader: - kv 6: command-r.attention.head_count u32 = 96 llama_model_loader: - kv 7: command-r.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: command-r.rope.freq_base f32 = 75000000.000000 llama_model_loader: - kv 9: command-r.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 30 llama_model_loader: - kv 11: command-r.logit_scale f32 = 0.833333 llama_model_loader: - kv 12: command-r.rope.scaling.type str = none llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "",llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1,llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,253333] = ["ト ト", "ト t", "e r", "i n", "ト llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 5 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 255001 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - kv 23: tokenizer.chat_template.tool_use str = {{ bos_token }}{% if messages[0]['rollama_model_loader: - kv 24: tokenizer.chat_template.rag str = {{ bos_token }}{% if messages[0]['rollama_model_loader: - kv 25: tokenizer.chat_templates arr[str,2] = ["rag", "tool_use"] llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rollama_model_loader: - type f32: 193 tensors llama_model_loader: - type q5_K: 64 tensors llama_model_loader: - type q6_K: 1 tensors llama_model_loader: - type iq4_xs: 384 tensors llm_load_vocab: special tokens definition check successful ( 1008/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = command-r llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 253333 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 12288 llm_load_print_meta: n_head = 96 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 64 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 12 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 8.3e-01 llm_load_print_meta: n_ff = 33792 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = none llm_load_print_meta: freq_base_train = 75000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = IQ4_XS - 4.25 bpw llm_load_print_meta: model params = 103.81 B llm_load_print_meta: model size = 56190287872.00 Bytes (4.33 BPW) llm_load_print_meta: model size = 54873328.00 KiB (4.33 BPW) llm_load_print_meta: model size = 53587.23 MiB (4.33 BPW) llm_load_print_meta: model size = 52.33 GiB (4.33 BPW) llm_load_print_meta: model size = 56190287.87 KB (4.33 BPW) llm_load_print_meta: model size = 56190.29 MB (4.33 BPW) llm_load_print_meta: model size = 56.19 GB (4.33 BPW) llm_load_print_meta: general.name = c4ai-command-r-plus llm_load_print_meta: BOS token = 5 '' llm_load_print_meta: EOS token = 255001 '<|END_OF_TURN_TOKEN|>' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 136 'テ・ ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.74 MiB llm_load_tensors: offloading 9 repeating layers to GPU llm_load_tensors: offloaded 9/65 layers to GPU llm_load_tensors: CPU buffer size = 53587.23 MiB llm_load_tensors: CUDA0 buffer size = 7189.63 MiB .............................................................................................. Automatic RoPE Scaling: Using model internal value. llama_new_context_with_model: n_ctx = 65536 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 75000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 14080.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 2304.00 MiB llama_new_context_with_model: KV self size = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB llama_new_context_with_model: CUDA0 compute buffer size = 12768.05 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 152.01 MiB llama_new_context_with_model: graph nodes = 2312 llama_new_context_with_model: graph splits = 719 Load Text Model OK: True Embedded Kobold Lite loaded. Starting Kobold API on port 5001 at http://localhost:5001/api/ Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 65536, "max_length": 512, "rep_pen": 1, "temperature": 1, "top_p": 1, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 0, "rep_pen_slope": 0, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": "", "genkey": "KCPP8699", "min_p": 0.05, "dynatemp_range": 0, "dynatemp_exponent": 10, "smoothing_factor": 0.2, "presence_penalty": 0, "logit_bias": {}, "prompt": "<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>What is a kobold?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>", "quiet": true, "stop_sequence": ["<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>", "<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>"], "use_default_badwordsids": false}

Processing Prompt (13 / 13 tokens) Generating (31 / 512 tokens) (EOS token triggered!) CtxLimit: 44/65536, Process:6.97s (536.5ms/T = 1.86T/s), Generate:74.35s (2398.5ms/T = 0.42T/s), Total:81.33s (0.38T/s) Output: In Germanic folklore, a kobold is a household spirit often associated with iron or mining. They can either help or hinder miners in their work.

SabinStargem commented 2 months ago

v1.6.4.1 of Kobold has this issue fixed up. My RP is up to 57k.

Whatever you did, it is good. :)