ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.34k stars 9.54k forks source link

server: system prompt makes generated text incoherent #4103

Closed z80maniac closed 6 months ago

z80maniac commented 11 months ago

Current Behavior

Passing a system prompt to the server makes the generated text incoherent after the first request.

Environment and Context

Commit: 8da46278e1a57107591653275f8e03a281de94f0

OS: Kubuntu 23.10

❯ lscpu | grep -P 'Model name|Flags' Model name: AMD Ryzen 9 7900 12-Core Processor Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
❯ uname -a
Linux comp 6.5.0-10-generic #10-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct 13 13:49:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
❯ make --version | head -1
GNU Make 4.3
❯ g++ --version | head -1
g++ (Ubuntu 13.2.0-4ubuntu3) 13.2.0

Steps to Reproduce

  1. I used this model: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/blob/main/llama-2-13b-chat.Q4_K_M.gguf

  2. The server is built with just make, no other params.

  3. Start the server:

    ./server -m /opt/models/text/llama-2-13b-chat.Q4_K_M.gguf
startup log ``` {"timestamp":1700156091,"level":"INFO","function":"main","line":2268,"message":"build info","build":1519,"commit":"8da4627"} {"timestamp":1700156091,"level":"INFO","function":"main","line":2271,"message":"system info","n_threads":12,"n_threads_batch":-1,"total_threads":24,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /opt/models/text/llama-2-13b-chat.Q4_K_M.gguf (version GGUF V2) llama_model_loader: - tensor 0: token_embd.weight q4_K [ 5120, 32000, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 2: blk.0.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 3: blk.0.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 4: blk.0.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 5: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 6: blk.0.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 7: blk.0.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 8: blk.0.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 9: blk.0.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 10: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 11: blk.1.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 12: blk.1.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 13: blk.1.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 14: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 15: blk.1.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 16: blk.1.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 17: blk.1.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 18: blk.1.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 19: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 20: blk.10.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 21: blk.10.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 22: blk.10.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 23: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 24: blk.10.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 25: blk.10.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 26: blk.10.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 27: blk.10.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 28: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 29: blk.11.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 30: blk.11.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 31: blk.11.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 32: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 33: blk.11.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 34: blk.11.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 35: blk.11.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 36: blk.11.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 37: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 38: blk.12.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 39: blk.12.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 40: blk.12.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 41: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 42: blk.12.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 43: blk.12.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 44: blk.12.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 45: blk.12.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 46: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 47: blk.13.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 48: blk.13.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 49: blk.13.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 50: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 51: blk.13.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 52: blk.13.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 53: blk.13.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 54: blk.13.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 55: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 56: blk.14.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 57: blk.14.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 58: blk.14.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 59: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 60: blk.14.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 61: blk.14.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 62: blk.14.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 63: blk.14.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 64: blk.15.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 65: blk.15.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 66: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 67: blk.2.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 68: blk.2.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 69: blk.2.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 70: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 71: blk.2.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 72: blk.2.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 73: blk.2.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 74: blk.2.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 75: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 76: blk.3.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 77: blk.3.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 78: blk.3.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 79: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 80: blk.3.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 81: blk.3.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 82: blk.3.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 83: blk.3.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 84: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 85: blk.4.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 86: blk.4.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 87: blk.4.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 88: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 89: blk.4.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 90: blk.4.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 91: blk.4.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 92: blk.4.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 93: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 94: blk.5.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 95: blk.5.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 96: blk.5.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 97: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 98: blk.5.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 99: blk.5.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 100: blk.5.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 101: blk.5.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 102: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 103: blk.6.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 104: blk.6.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 105: blk.6.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 106: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 107: blk.6.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 108: blk.6.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 109: blk.6.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 110: blk.6.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 111: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 112: blk.7.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 113: blk.7.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 114: blk.7.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 115: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 116: blk.7.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 117: blk.7.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 118: blk.7.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 119: blk.7.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 120: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 121: blk.8.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 122: blk.8.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 123: blk.8.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 124: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 125: blk.8.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 126: blk.8.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 127: blk.8.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 128: blk.8.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 129: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 130: blk.9.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 131: blk.9.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 132: blk.9.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 133: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 134: blk.9.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 135: blk.9.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 136: blk.9.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 137: blk.9.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 138: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 139: blk.15.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 141: blk.15.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 142: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 143: blk.15.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 144: blk.15.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 145: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 146: blk.16.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 147: blk.16.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 148: blk.16.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 149: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 150: blk.16.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 151: blk.16.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 152: blk.16.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 153: blk.16.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 154: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 155: blk.17.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 156: blk.17.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 157: blk.17.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 158: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 159: blk.17.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 160: blk.17.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 161: blk.17.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 162: blk.17.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 163: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 164: blk.18.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 165: blk.18.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 166: blk.18.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 167: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 168: blk.18.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 169: blk.18.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 170: blk.18.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 171: blk.18.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 172: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 173: blk.19.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 174: blk.19.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 175: blk.19.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 176: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 177: blk.19.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 178: blk.19.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 179: blk.19.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 180: blk.19.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 181: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 182: blk.20.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 183: blk.20.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 184: blk.20.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 185: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 186: blk.20.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 187: blk.20.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 188: blk.20.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 189: blk.20.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 190: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 191: blk.21.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 192: blk.21.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 193: blk.21.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 194: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 195: blk.21.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 196: blk.21.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 197: blk.21.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 198: blk.21.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 199: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 200: blk.22.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 201: blk.22.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 202: blk.22.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 203: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 204: blk.22.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 205: blk.22.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 206: blk.22.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 207: blk.22.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 208: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 209: blk.23.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 210: blk.23.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 211: blk.23.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 212: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 213: blk.23.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 214: blk.23.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 215: blk.23.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 216: blk.23.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 217: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 218: blk.24.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 219: blk.24.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 220: blk.24.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 221: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 222: blk.24.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 223: blk.24.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 224: blk.24.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 225: blk.24.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 226: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 227: blk.25.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 228: blk.25.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 229: blk.25.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 230: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 231: blk.25.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 232: blk.25.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 233: blk.25.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 234: blk.25.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 235: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 236: blk.26.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 237: blk.26.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 238: blk.26.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 239: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 240: blk.26.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 241: blk.26.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 242: blk.26.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 243: blk.26.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 244: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 245: blk.27.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 246: blk.27.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 247: blk.27.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 248: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 249: blk.27.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 250: blk.27.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 251: blk.27.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 252: blk.27.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 253: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 254: blk.28.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 255: blk.28.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 256: blk.28.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 257: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 258: blk.28.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 259: blk.28.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 260: blk.28.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 261: blk.28.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 262: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 263: blk.29.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 264: blk.29.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 265: blk.29.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 266: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 267: blk.29.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 268: blk.29.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 269: blk.29.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 270: blk.29.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 271: blk.30.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 272: blk.30.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 273: blk.30.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 274: blk.30.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 275: blk.30.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 276: blk.30.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 277: output.weight q6_K [ 5120, 32000, 1, 1 ] llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 279: blk.30.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 280: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 281: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 282: blk.31.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 283: blk.31.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 284: blk.31.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 285: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 286: blk.31.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 287: blk.31.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 288: blk.31.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 289: blk.31.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 290: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 291: blk.32.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 292: blk.32.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 293: blk.32.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 294: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 295: blk.32.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 296: blk.32.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 297: blk.32.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 298: blk.32.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 299: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 300: blk.33.ffn_down.weight q4_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 301: blk.33.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 302: blk.33.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 303: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 304: blk.33.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 305: blk.33.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 306: blk.33.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 307: blk.33.attn_v.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 308: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 309: blk.34.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 310: blk.34.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 311: blk.34.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 312: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 313: blk.34.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 314: blk.34.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 315: blk.34.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 316: blk.34.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 317: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 318: blk.35.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 319: blk.35.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 320: blk.35.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 321: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 322: blk.35.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 323: blk.35.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 324: blk.35.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 325: blk.35.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 326: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 327: blk.36.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 328: blk.36.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 329: blk.36.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 330: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 331: blk.36.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 332: blk.36.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 333: blk.36.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 334: blk.36.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 335: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 336: blk.37.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 337: blk.37.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 338: blk.37.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 339: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 340: blk.37.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 341: blk.37.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 342: blk.37.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 343: blk.37.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 344: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 345: blk.38.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 346: blk.38.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 347: blk.38.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 348: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 349: blk.38.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 350: blk.38.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 351: blk.38.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 352: blk.38.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 353: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 354: blk.39.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 355: blk.39.ffn_gate.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 356: blk.39.ffn_up.weight q4_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 357: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 358: blk.39.attn_k.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 359: blk.39.attn_output.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 360: blk.39.attn_q.weight q4_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 361: blk.39.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 362: output_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str llama_model_loader: - kv 2: llama.context_length u32 llama_model_loader: - kv 3: llama.embedding_length u32 llama_model_loader: - kv 4: llama.block_count u32 llama_model_loader: - kv 5: llama.feed_forward_length u32 llama_model_loader: - kv 6: llama.rope.dimension_count u32 llama_model_loader: - kv 7: llama.attention.head_count u32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 llama_model_loader: - kv 10: general.file_type u32 llama_model_loader: - kv 11: tokenizer.ggml.model str llama_model_loader: - kv 12: tokenizer.ggml.tokens arr llama_model_loader: - kv 13: tokenizer.ggml.scores arr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 llama_model_loader: - kv 18: general.quantization_version u32 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q4_K: 241 tensors llama_model_loader: - type q6_K: 41 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = mostly Q4_K - Medium llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 7.33 GiB (4.83 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.13 MB llm_load_tensors: mem required = 7500.98 MB .................................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 400.00 MB llama_build_graph: non-view tensors processed: 924/924 llama_new_context_with_model: compute buffer total size = 76.57 MB Available slots: -> Slot 0 - max context: 512 llama server listening at http://127.0.0.1:8080 {"timestamp":1700156092,"level":"INFO","function":"main","line":2548,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080} all slots are idle and system prompt is empty, clear the KV cache ```
  1. Call the API and set the system prompt in it, e.g:
    curl -sS -H 'Content-Type: application/json' --data '{"n_predict":8, "prompt":"When he looked in the mirror, he saw that he", "system_prompt": {"prompt": "This is a story about a mysterious man."}, "cache_prompt": true}' http://127.0.0.1:8080/completion | jq .content

Failure Logs

First, these are the results, if you never set the system prompt:

" had become a monkey.\n\nHe"
" was still dressed as a woman.\n\""
" had aged. He had grown old and fra"
" was no longer a handsome man. His"
" had grown a goatee.\nWhen"

Seems OK. Now the results with the system prompt:

" was no longer young and handsome. He"
" he he. He was a man of mystery"
". He was a mysterious man with a"
" he\nThis is a story about a myster"
" ha ha ha ha ha ha ha ha."

The first result is always OK, but the rest are just either end abruptly with a full stop or contain some nonsense. It also doesn't matter if you specify the system prompt in the second and later requests or not.

Also, if you reset the system prompt (set it to an empty string), the output seems weird and disconnected as well, but it also almost never starts with a space, e.g. this query:

curl -sS -H 'Content-Type: application/json' --data '{"n_predict":8, "prompt":"When he looked in the mirror, he saw that he", "system_prompt": {"prompt": ""}, "cache_prompt": true}' http://127.0.0.1:8080/completion | jq .content

Gives these results:

"ated debate about immigration and border security."
"aling process.\n2. Diet: E"
"ist film genre.\nThe story follows a"
"brew alphabet song lyrics and the he"
"ir of God’s promise to Abraham."
KerfuffleV2 commented 11 months ago

It doesn't seem like you're following the LLaMA2-chat instruct format so it's pretty typical to get poor results in that case. Following the model's expected prompt format for instruct tuned models is usually a good idea.

There also may be issues with how server handles the system prompt.

z80maniac commented 11 months ago

It doesn't seem like you're following the LLaMA2-chat instruct format

But why is the first response always fine then? The results start to become incoherent only from the second request. And I also get poor results even if the system prompt is an empty string. Is there a difference between not specifying the system prompt and setting it as an empty string?

KerfuffleV2 commented 11 months ago

But why is the first response always fine then?

Models can be unpredictable when you don't follow the prompt format. Like I said, it may not be the only issue at play. Generally when you don't follow the format the model was trained on, it's not going to perform as well or reliably.

z80maniac commented 11 months ago

Models can be unpredictable when you don't follow the prompt format.

Yes, but not starting from the second request only. The model itself is stateless, so it means that something in the server breaks after the first request.

Here is an example that uses the system prompt from the docs, and I believe it's a correct system prompt for Llama2-chat, judging by the info on the HuggingFace card.

curl -sS -H 'Content-Type: application/json' --data '{"n_predict":16, "system_prompt": { "prompt": "Transcript of a never ending dialog, where the User interacts with an Assistant.\\nThe Assistant is helpful, kind, honest, good at writing, and never fails to answer the User\'s requests immediately and with precision.\\nUser: Recommend a nice restaurant in the area.\\nAssistant: I recommend the restaurant \\"The Golden Duck\\". It is a 5 star restaurant with a great view of the city. The food is delicious and the service is excellent. The prices are reasonable and the portions are generous. The restaurant is located at 123 Main Street, New York, NY 10001. The phone number is (212) 555-1234. The hours are Monday through Friday from 11:00 am to 10:00 pm. The restaurant is closed on Saturdays and Sundays.\\nUser: Who is Richard Feynman?\\nAssistant: Richard Feynman was an American physicist who is best known for his work in quantum mechanics and particle physics. He was awarded the Nobel Prize in Physics in 1965 for his contributions to the development of quantum electrodynamics. He was a popular lecturer and author, and he wrote several books, including \\"Surely You\'re Joking, Mr. Feynman!\\" and \\"What Do You Care What Other People Think?\\".\\nUser:", "anti_prompt": "User:", "assistant_name": "Assistant:"}, "cache_prompt": true, "prompt": "What is the capital of Canada?\\nAssistant:"}' http://127.0.0.1:8080/completion | jq .content

The question was What is the capital of Canada?. Here are the results:

" The capital of Canada is Ottawa."
" I want to learn more about quantum mechanics. Can you recommend some resources?\n"
" Can you give me some recommendations for books to read?\nAssistant: C"

Again, the first response is OK, but after that it looks like the server ignores the prompt altogether (or something like that).

Seems like the problem is in "cache_prompt": true. When I remove it I get what is expected (almost):

" The capital of Canada is Ottawa."
" The capital of Canada is Ottawa.\nUser: What is the population of New"
" The capital of Canada is Ottawa.\nUser: Can you write a poem about"

Now it gives correct answers, but starting from the second request it does not stop on the anti_prompt. But I guess it's a different issue, and can be mitigated by "stop": ["User:"].

WeirdConstructor commented 11 months ago

There is a slim possibility that this is may be connected with #4113 in some way. Because the server is also using multiple sequences in the KV cache, and there seems to be something fishy about the order of decoding.

KerfuffleV2 commented 11 months ago

Here is an example that uses the system prompt from the docs, and I believe it's a correct system prompt for Llama2-chat, judging by the info on the HuggingFace card.

Your own link to the model you're using has a suggested prompt format that's completely different from what you're using. So it's not just the system prompt format that's wrong, it's the entirety of your prompt: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF#prompt-template-llama-2-chat

If you think you'll get better results using a format the model wasn't trained to expect, then by all means go ahead and use whatever you want. I'm going to unsubscribe from this one. Hope you're able to solve your problem.

z80maniac commented 11 months ago

Here are the results when I pass that prompt template from the model card as is: curl -sS -H 'Content-Type: application/json' --data '{"n_predict":16, "system_prompt": { "prompt": "[INST] <<SYS>>\\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\'t know the answer to a question, please don\'t share false information.\\n<</SYS>>\\n{prompt}[/INST]", "anti_prompt": "User:", "assistant_name": "Assistant:"}, "prompt": "What is the capital of Canada?\\nAssistant:", "cache_prompt": true}' http://127.0.0.1:8080/completion | jq .content

" The capital of Canada is Ottawa."
" As a helpful, respectful, and honest assistant, I will do my best to"
" Sure, I'm here to help! Please go ahead and ask your question."

Even if I add User: to the end of the system prompt or to the start of the prompt, the results are the same. So, no, the contents or formatting of the system prompt does not matter in this case. Something else is wrong with the server.

Okabintaro commented 10 months ago

I have observed a similar behavior in the master branch (commit 2568a4bf548d7392e9c78c008b33b4c11d53fe95), that also suggests there might be a problem with cache handling on the server.

When sampling with "temperature": 0, "cache_prompt": true and using the same prompt, the first request results in a sensible output, while all subsequent requests produce the same nonsensical output.

Further investigation revealed that in the main example with caching enabled works correctly on the CPU. However, it seems to encounter the exact same issue when the model is offloaded to the GPU.

Sorry I don't have time to provide concrete examples at the moment. I might update this with more details later.

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.