ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.27k stars 9.53k forks source link

"main : failed to eval" when LLM produces a long output #4326

Closed l29ah closed 6 months ago

l29ah commented 10 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Behavior

I've downloaded https://huggingface.co/TheBloke/OpenHermes-2.5-neural-chat-7B-v3-1-7B-GGUF/resolve/main/openhermes-2.5-neural-chat-7b-v3-1-7b.Q5_K_M.gguf?download=true and played with it. Every time it produced a long reply for me, it abruptly stopped with "main : failed to eval" and the process exiting.

Environment and Context

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz CPU family: 6 Model: 142 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 10 CPU(s) scaling MHz: 77% CPU max MHz: 4000.0000 CPU min MHz: 400.0000 BogoMIPS: 4001.60 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology no nstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm i da arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi md_clear flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all):
L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 8 MiB (1 instance) Vulnerabilities:
Gather data sampling: Vulnerable: No microcode Itlb multihit: KVM: Mitigation: VMX disabled L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Mds: Mitigation; Clear CPU buffers; SMT vulnerable Meltdown: Mitigation; PTI Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Retbleed: Mitigation; IBRS Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Srbds: Mitigation; Microcode Tsx async abort: Not affected

Linux l29ah-x201 6.6.0-dirty #244 SMP PREEMPT_DYNAMIC Mon Nov 6 00:44:33 CET 2023 x86_64 Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz GenuineIntel GNU/Linux

GNU Make 4.4.1 Built for x86_64-pc-linux-gnu Copyright (C) 1988-2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later https://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

clang version 17.0.6 Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/lib/llvm/17/bin Configuration file: /etc/clang/x86_64-pc-linux-gnu-clang.cfg

Steps to Reproduce

./main -m ./models/openhermes-2.5-neural-chat-7b-v3-1-7b/ggml-model-q5_k_m.gguf -n -1 -t 4 --color -f ./prompts/chat-with-doctor.txt --prompt-cache ./models/openhermes-2.5-neural-chat-7b-v3-1-7b/ggml-model-q5_k_m_chat-with-doctor.txt.prompt --chatml --n-predict -2

Failure Logs

Log start
main: build = 1604 (33e171d)
main: built with clang version 17.0.6 for x86_64-pc-linux-gnu
main: seed  = 1701704904
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from ./models/openhermes-2.5-neural-chat-7b-v3-1-7b/ggml-model-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    8:              blk.0.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    9:              blk.0.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   10:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   11:            blk.1.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   13:              blk.1.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   16:         blk.1.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   17:              blk.1.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   18:              blk.1.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   19:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   20:           blk.10.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   21:           blk.10.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   22:             blk.10.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   23:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   24:             blk.10.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   25:        blk.10.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   26:             blk.10.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   27:             blk.10.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   28:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   29:           blk.11.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   30:           blk.11.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   31:             blk.11.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   32:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   33:             blk.11.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   34:        blk.11.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   35:             blk.11.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   36:             blk.11.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   37:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   38:           blk.12.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   39:           blk.12.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   40:             blk.12.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   41:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   42:             blk.12.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   43:        blk.12.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   44:             blk.12.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   45:             blk.12.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   46:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   47:           blk.13.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   48:           blk.13.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   49:             blk.13.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   50:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   51:             blk.13.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   52:        blk.13.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   53:             blk.13.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   54:             blk.13.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   55:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   56:           blk.14.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   57:           blk.14.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   58:             blk.14.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   59:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   60:             blk.14.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   61:        blk.14.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   62:             blk.14.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   63:             blk.14.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   64:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   65:           blk.15.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   66:           blk.15.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   67:             blk.15.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   68:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   69:             blk.15.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   70:        blk.15.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   71:             blk.15.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   72:             blk.15.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   73:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   74:           blk.16.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   75:           blk.16.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   76:             blk.16.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   77:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   78:             blk.16.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   79:        blk.16.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   80:             blk.16.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   81:             blk.16.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   82:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   83:           blk.17.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   84:           blk.17.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   85:             blk.17.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   86:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   87:             blk.17.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   88:        blk.17.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   89:             blk.17.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   90:             blk.17.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   91:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   92:           blk.18.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor   93:           blk.18.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   94:             blk.18.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor   95:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   96:             blk.18.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor   97:        blk.18.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   98:             blk.18.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   99:             blk.18.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  100:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  101:           blk.19.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  102:           blk.19.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  103:             blk.19.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  104:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  105:             blk.19.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  106:        blk.19.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  107:             blk.19.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  108:             blk.19.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  109:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  110:            blk.2.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  111:            blk.2.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  112:              blk.2.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  113:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  114:              blk.2.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  115:         blk.2.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  116:              blk.2.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  117:              blk.2.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  118:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  119:           blk.20.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  120:           blk.20.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  121:             blk.20.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  122:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  123:             blk.20.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  124:        blk.20.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  125:             blk.20.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  126:             blk.20.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  127:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  128:           blk.21.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  129:           blk.21.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  130:             blk.21.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  131:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  132:             blk.21.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  133:        blk.21.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  134:             blk.21.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  135:             blk.21.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  136:             blk.22.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  137:        blk.22.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  138:             blk.22.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  139:             blk.22.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  140:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  141:            blk.3.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  142:            blk.3.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  143:              blk.3.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  144:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  145:              blk.3.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  146:         blk.3.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  147:              blk.3.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  148:              blk.3.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  149:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  150:            blk.4.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  151:            blk.4.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  152:              blk.4.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  153:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  154:              blk.4.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  155:         blk.4.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  156:              blk.4.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  157:              blk.4.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  158:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  159:            blk.5.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  160:            blk.5.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  161:              blk.5.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  162:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  163:              blk.5.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  164:         blk.5.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  165:              blk.5.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  166:              blk.5.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  167:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  168:            blk.6.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  169:            blk.6.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  170:              blk.6.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  171:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  172:              blk.6.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  173:         blk.6.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  174:              blk.6.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  175:              blk.6.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  176:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  177:            blk.7.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  178:            blk.7.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  179:              blk.7.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  180:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  181:              blk.7.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  182:         blk.7.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  183:              blk.7.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  184:              blk.7.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  185:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  186:            blk.8.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  187:            blk.8.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  188:              blk.8.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  189:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  190:              blk.8.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  191:         blk.8.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  192:              blk.8.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  193:              blk.8.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  194:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  195:            blk.9.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  196:            blk.9.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  197:              blk.9.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  198:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  199:              blk.9.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  200:         blk.9.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  201:              blk.9.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  202:              blk.9.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  203:                    output.weight q6_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor  204:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  205:           blk.22.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  206:           blk.22.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  207:             blk.22.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  208:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  209:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  210:           blk.23.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  211:           blk.23.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  212:             blk.23.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  213:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  214:             blk.23.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  215:        blk.23.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  216:             blk.23.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  217:             blk.23.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  218:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  219:           blk.24.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  220:           blk.24.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  221:             blk.24.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  222:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  223:             blk.24.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  224:        blk.24.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  225:             blk.24.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  226:             blk.24.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  227:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  228:           blk.25.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  229:           blk.25.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  230:             blk.25.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  231:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  232:             blk.25.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  233:        blk.25.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  234:             blk.25.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  235:             blk.25.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  236:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  237:           blk.26.ffn_down.weight q5_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  238:           blk.26.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  239:             blk.26.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  240:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  241:             blk.26.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  242:        blk.26.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  243:             blk.26.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  244:             blk.26.attn_v.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  245:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  246:           blk.27.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  247:           blk.27.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  248:             blk.27.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  249:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  250:             blk.27.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  251:        blk.27.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  252:             blk.27.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  253:             blk.27.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  254:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  255:           blk.28.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  256:           blk.28.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  257:             blk.28.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  258:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  259:             blk.28.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  260:        blk.28.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  261:             blk.28.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  262:             blk.28.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  263:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  264:           blk.29.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  265:           blk.29.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  266:             blk.29.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  267:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  268:             blk.29.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  269:        blk.29.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  270:             blk.29.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  271:             blk.29.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  272:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  273:           blk.30.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  274:           blk.30.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  275:             blk.30.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  276:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  277:             blk.30.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  278:        blk.30.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  279:             blk.30.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  280:             blk.30.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  281:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  282:           blk.31.ffn_down.weight q6_K     [ 14336,  4096,     1,     1 ]
llama_model_loader: - tensor  283:           blk.31.ffn_gate.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  284:             blk.31.ffn_up.weight q5_K     [  4096, 14336,     1,     1 ]
llama_model_loader: - tensor  285:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  286:             blk.31.attn_k.weight q5_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  287:        blk.31.attn_output.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.attn_q.weight q5_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  289:             blk.31.attn_v.weight q6_K     [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  290:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = weyaxi_openhermes-2.5-neural-chat-7b-...
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q5_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.78 GiB (5.67 BPW) 
llm_load_print_meta: general.name     = weyaxi_openhermes-2.5-neural-chat-7b-v3-1-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: mem required  = 4893.10 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =   64.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.07 MiB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
main: attempting to load saved session from './models/openhermes-2.5-neural-chat-7b-v3-1-7b/ggml-model-q5_k_m_chat-with-doctor.txt.prompt'
main: loaded a session with prompt size of 152 tokens
main: session file has exact match for prompt!
main: interactive mode on.
Reverse prompt: '<|im_start|>user
'
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = -2, n_keep = 101

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 <|im_start|>system
A transcript of a conversation between a curious patient (user) and an extremely skilled and knowledgeable general practitioner of medicine with over 30 years of practice (assistant). The patient (user) is having a medical appointment with the healthcare professional doctor (assistant). The doctor (assistant) gives helpful, detailed, and precise answers to patient (user)'s questions and asks questions in unclear cases.<|im_end|>
> How to treat allergy?
Allergy treatment typically involves a combination of avoiding triggers, managing symptoms, and sometimes using medications. Here are some steps you can take:

1. Identify your allergens: Determine what triggers your allergy by keeping track of when symptoms occur or consulting with an allergist for testing. Common allergens include pollen, dust mites, pet dander, mold spores, and certain foods.

2. Avoidance: Stay away from known allergens as much as possible to minimize exposure. This may involve using hypoallergenic bedding, washing sheets regularly, limiting outdoor activities during high pollen times, or removing pets from your home if they cause an allergy flare-up.

3. Medications: Over-the-counter antihistamines can help with mild allergies by reducing itching, sneezing, and runny nose. Decongestants may also alleviate nasal congestion. If these do not provide relief or if symptoms are severe, consult your doctor for prescription medications like corticosteroid nasal sprays, eye drops, or allergy shots (immunotherapy).

4. Allergy shots: Immunotherapy involves injecting small amounts of allergens under the skin to gradually build tolerance. This treatment is most suitable for people with persistent and severe allergies who don't respond well to other treatments.

5. Other interventions: For individuals with asthma or recurrent sinus infections triggered by allergies, inhalers, nasal steroids, or antibiotics may be prescribed.

6. Self-care measures: To alleviate symptoms, ensure you get proper sleep, eat a well-balanced diet, stay hydrated, and exercise regularly tomain : failed to eval
cmp-nct commented 10 months ago

Change context size -c 2048 (or more)

That should solve your reported problem. However I am not sure about the model being 100% supported, maybe someone else knows. In the config I see no rope scaling but a 4k sliding attention window, I don't think that is built into llama.cpp (yet).

I'm also not 100% sure what sliding attention really does, from a brute view it's just a 4k context limitation of all tokens but I must be missing something. In that case anything beyond -c 4096 is unlikely going to work well.

l29ah commented 10 months ago

-c 2048 results in this straight away:

main: attempting to load saved session from './models/openhermes-2.5-neural-chat-7b-v3-1-7b/ggml-model-q5_k_m_chat-with-doctor.txt.prompt'
GGML_ASSERT: llama.cpp:9659: kv_self.buf.size == kv_buf_size
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007ff38a0fc717 in wait4 () from /lib64/libc.so.6
#0  0x00007ff38a0fc717 in wait4 () from /lib64/libc.so.6
#1  0x0000557a43319144 in ggml_print_backtrace ()
#2  0x0000557a4337c41e in llama_set_state_data ()
#3  0x0000557a4337c777 in llama_load_session_file ()
#4  0x0000557a4330eee5 in main ()

Removing the cached prompt seems to solve it. Thanks!

rhvall commented 9 months ago

I think this issue is present in develop at commit b9f47952ffae4e with following command:

build/bin/main -m model/mistral-7b-instruct-v0.2.Q3_K_S.gguf -ngl 1 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt

By default, the n_ctx is 512. Somehow, when the first prompt is larger than that value, it fails to evaluate it with the reduced context:

Log start
main: build = 1699 (b9f4795)
main: built with Apple clang version 14.0.0 (clang-1400.0.29.202) for arm64-apple-darwin21.6.0
main: seed  = 1703569953
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from model/mistral-7b-instruct-v0.2.Q3_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 11
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q3_K:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q3_K - Small
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 2.95 GiB (3.50 BPW) 
llm_load_print_meta: general.name     = mistralai_mistral-7b-instruct-v0.2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =  3017.97 MiB, ( 3018.03 / 49152.00)
llm_load_tensors: system memory used  = 3017.38 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading 'llamaCPP/build/bin/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    64.00 MiB, ( 3082.66 / 49152.00)
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, ( 3082.67 / 49152.00)
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 76.19 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    73.02 MiB, ( 3155.67 / 49152.00)

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: 'User:'
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:Describe in one sentence the following text: "In 2014 Apple introduced Swift. It’s a multi-paradigm, compiled, statically and strongly-typed language. Swift has powerful and rich value types that support methods, implementing protocols(interfaces), extensions etc. Even though Apple recommended using value types since then, the milestones in the brief history of Swift were actually the following two sessions of WWDC 2015. In these sessions, Apple strongly advised using value types more often. Value semantics serve to eliminate mutation and remove unintended sharing of state and related side effects. By providing powerful value types, Swift aims to maximize value type usage to avoid possible errors related to sharing the state. Meantime value types provide better performance metrics than reference types. Therewithal it was Swift taking a pass at the functional programming community since the latter aims the same goals, even in today’s increasingly concurrent world. Functional programming also depends on the paradigm 'thinking in functional style'. In functional programming world, there is no country for shared state, mutating state and related side effects. That said it has its disadvantages as well, such as its inability to fit perfectly to machine model or its inefficiency in cases where the mutation is a good choice. This is a huge topic that goes beyond the scope of this article. So I will not go deeper into this heated debate and leave it here for now. Here is a quick refresher for value semantics and the features of value types presented in aforementioned WWDC sessions. No Shared State (Auto Copying) & Immutability. Mutating an instance will never affect another.Instances of value types are created in the stack and on each assignment or passing the value around (between functions or threads) there will be a unique instance(if compiler is not sure there will be no mutation, a new copy) and it will be passed. Therefore, you are guaranteed with no shared state. And it’s not possible to mutate an instance unintendedly. As you see on assignment our struct is automatically copied and this copy is mutated. So value types don’t have shared state, and they have an auto-copying feature. Swift’s collection types (Dictionary, Array, String, Set etc.) are value types that are backed by reference types. In these types copy-on-write performance optimization implemented by default in order to avoid mutation issues. Basically copy-on-write provides creating another instance only when the first instance is mutated. Otherwise, a single instance is shared among the variables. So collections are safe for mutability. Here is a quick refresher for value semantics and the features of value types presented in aforementioned WWDC sessions. No Shared State (Auto Copying) & Immutability Mutating an instance will never affect another. Instances of value types are created in the stack and on each assignment or passing the value around (between functions or threads) there will be a unique instance(if compiler is not sure there will be no mutation, a new copy) and it will be passed. Therefore, you are guaranteed with no shared state. And it’s not possible to mutate an instance unintendedly.
<<input too long: skipped 4 tokens>>main : failed to eval

As shown in the log, at the end it shows main : failed to eval. I understand that using a larger context (ex. -c 1024) would resolve this issue. On the other hand, if the _kv_ changes are working correctly, it should sumarize the text without trouble, right??

rhvall commented 9 months ago

A similar case with the lookup example, using a smaller context than the text to summarize would fail to execute, even if the restriction to max_tokens_list_size is removed.

For that, I used this command: lookup -m "model/mistral-7b-instruct-v0.2.Q3_K_S.gguf" -f summary.txt --temp 0.0 -c 1024 -b 1024 -n -1 --no-penalize-nl --verbose-prompt --draft 10

Compiled using the same dev commit (b9f47952ffae4e) and the summary.txt file here

I think that the model should work moving the context to fill as much as possible without failing, even if the output is degraded compared to a larger context.

rhvall commented 9 months ago

@ggerganov apologies for the tag, nonetheless I think this issue is still persistent on release "b1874". Are there any suggestions on how to address it?? Thanks in advance.

ggerganov commented 9 months ago

This is expected - the prompt cannot be larger than the KV cache size (i.e. the -c argument)

rhvall commented 9 months ago

This is expected - the prompt cannot be larger than the KV cache size (i.e. the -c argument)

Thanks for your reply.

I understand that the data passed along to the model needs to be up-to the --ctx-size number of tokens. On the other hand, how models are able to process data beyond their context size??

In the main.cpp example, there is the kv functions that remove some and move some other elements around, like this:

llama_kv_cache_seq_rm   (ctx, 0, params.n_keep + 1            , params.n_keep + n_discard + 1);
llama_kv_cache_seq_shift(ctx, 0, params.n_keep + 1 + n_discard, n_past, -n_discard);

Can't this process be used to process large prompts in chunks of --ctx-size??

ggerganov commented 9 months ago

It can, but currently it is implemented only for the text-generation phase. For prompt processing it looks like a less-common use case, that's why it is not available in the examples. Also, if you are going to be discarding the prompt why even try to feed it in the first place

rhvall commented 9 months ago

text-generation

I see, maybe it is a misunderstanding from my part. I thought that providing the prompt in batches, then moving the kv_cache sequence, the model would be able to interpret the whole text provided. A particular application I was thinking was text summarization, which could easily exceed the number of tokens in the context.

Thanks for your reply

ggerganov commented 9 months ago

Generally a model cannot operate with more tokens than it's training context. There are techniques to try to overcome this limitations such as rope scaling and KV cache massaging, but these are more advanced use-cases and don't work universally.

The context-shift technique that is implemented for text-generation with main is a basic way of discarding the oldest tokens in order to "free" some space for new tokens. However, the model will "forget" about the discarded tokens.

You can keep an eye on the "self-extend" approach as it seems promising, but currently the support in llama.cpp is pretty rudimentary. Should be improve with time

rhvall commented 9 months ago

Certainly, I will follow up with the changes your team does on the library, which is amazing by the way. If I find a way to contribute back, I will try my best.

Thanks for your help

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.