ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.57k stars 9.24k forks source link

ROCm: garbade output with low ngl #2968

Closed Jipok closed 5 months ago

Jipok commented 1 year ago

OS: Void Linux Kernel: 6.3.13 ROCm: 5.6.0

lscpu ``` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD Ryzen 7 6800H with Radeon Graphics CPU family: 25 Model: 68 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 1 Frequency boost: enabled CPU(s) scaling MHz: 37% CPU max MHz: 4784.3750 CPU min MHz: 1600.0000 BogoMIPS: 6387.93 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx m mxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pcl mulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bp ext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_ll c cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_s cale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm Virtualization features: Virtualization: AMD-V Caches (sum of all): L1d: 256 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 4 MiB (8 instances) L3: 16 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected ```
/opt/rocm/bin/rocminfo ``` ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 7 6800H with Radeon Graphics Uuid: CPU-XX Marketing Name: AMD Ryzen 7 6800H with Radeon Graphics Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 3200 BDFID: 0 Internal Node ID: 0 Compute Unit: 16 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 30545284(0x1d21584) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 30545284(0x1d21584) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 30545284(0x1d21584) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx1035 Uuid: GPU-XX Marketing Name: AMD Radeon Graphics Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 2048(0x800) KB Chip ID: 5761(0x1681) ASIC Revision: 2(0x2) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2200 BDFID: 29696 Internal Node ID: 1 Compute Unit: 12 SIMDs per CU: 2 Shader Engines: 1 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 2097152(0x200000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1035 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ```

Build:

CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake .. -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1030
cmake --build .

Run:

export HSA_OVERRIDE_GFX_VERSION=10.3.0`
./main -s 0 --temp 0 -m ~/Downloads/puddlejumper-13b.q8_0.gguf --color -e -i --in-prefix "USER: " --in-suffix "ASSISTANT: " -ngl 1 -p "USER: Who are you?\nASSISTANT:"
Output ``` Log start main: build = 1152 (8b56b4f) main: seed = 0 ggml_init_cublas: found 1 ROCm devices: Device 0: AMD Radeon Graphics, compute capability 10.3 llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from /home/kiv/Downloads/puddlejumper-13b.q8_0.gguf (version GGUF V1 (support until nov 2023)) llama_model_loader: - tensor 0: token_embd.weight q8_0 [ 5120, 32002, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 2: blk.0.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 3: blk.0.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 4: blk.0.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 6: blk.0.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 7: blk.0.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 10: blk.1.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 11: blk.1.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 12: blk.1.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 13: blk.1.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 14: blk.1.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 15: blk.1.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 16: blk.1.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 19: blk.2.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 20: blk.2.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 21: blk.2.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 22: blk.2.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 23: blk.2.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 24: blk.2.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 25: blk.2.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 28: blk.3.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 29: blk.3.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 30: blk.3.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 31: blk.3.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 32: blk.3.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 33: blk.3.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 34: blk.3.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 37: blk.4.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 38: blk.4.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 39: blk.4.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 40: blk.4.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 41: blk.4.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 42: blk.4.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 43: blk.4.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 46: blk.5.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 47: blk.5.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 48: blk.5.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 49: blk.5.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 50: blk.5.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 51: blk.5.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 52: blk.5.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 53: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 54: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 55: blk.6.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 56: blk.6.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 57: blk.6.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 58: blk.6.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 59: blk.6.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 60: blk.6.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 61: blk.6.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 62: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 63: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 64: blk.7.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 65: blk.7.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 66: blk.7.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 67: blk.7.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 68: blk.7.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 69: blk.7.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 70: blk.7.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 71: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 72: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 73: blk.8.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 74: blk.8.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 75: blk.8.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 76: blk.8.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 77: blk.8.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 78: blk.8.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 79: blk.8.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 80: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 81: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 82: blk.9.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 83: blk.9.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 84: blk.9.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 85: blk.9.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 86: blk.9.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 87: blk.9.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 88: blk.9.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 89: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 90: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 91: blk.10.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 92: blk.10.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 93: blk.10.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 94: blk.10.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 95: blk.10.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 96: blk.10.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 97: blk.10.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 98: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 99: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 100: blk.11.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 101: blk.11.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 102: blk.11.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 103: blk.11.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 104: blk.11.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 105: blk.11.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 106: blk.11.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 107: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 108: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 109: blk.12.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 110: blk.12.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 111: blk.12.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 112: blk.12.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 113: blk.12.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 114: blk.12.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 115: blk.12.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 116: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 117: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 118: blk.13.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 119: blk.13.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 120: blk.13.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 121: blk.13.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 122: blk.13.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 123: blk.13.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 124: blk.13.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 127: blk.14.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 128: blk.14.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 129: blk.14.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 130: blk.14.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 131: blk.14.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 132: blk.14.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 133: blk.14.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 134: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 135: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 136: blk.15.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 137: blk.15.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 138: blk.15.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 139: blk.15.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 141: blk.15.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 142: blk.15.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 143: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 144: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 145: blk.16.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 146: blk.16.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 147: blk.16.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 148: blk.16.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 149: blk.16.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 150: blk.16.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 151: blk.16.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 152: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 153: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 154: blk.17.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 155: blk.17.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 156: blk.17.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 157: blk.17.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 158: blk.17.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 159: blk.17.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 160: blk.17.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 161: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 162: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 163: blk.18.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 164: blk.18.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 165: blk.18.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 166: blk.18.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 167: blk.18.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 168: blk.18.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 169: blk.18.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 170: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 171: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 172: blk.19.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 173: blk.19.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 174: blk.19.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 175: blk.19.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 176: blk.19.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 177: blk.19.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 178: blk.19.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 179: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 180: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 181: blk.20.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 182: blk.20.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 183: blk.20.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 184: blk.20.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 185: blk.20.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 186: blk.20.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 187: blk.20.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 189: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 190: blk.21.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 191: blk.21.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 192: blk.21.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 193: blk.21.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 194: blk.21.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 195: blk.21.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 196: blk.21.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 197: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 198: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 199: blk.22.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 200: blk.22.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 201: blk.22.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 202: blk.22.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 203: blk.22.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 204: blk.22.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 205: blk.22.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 206: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 207: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 208: blk.23.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 209: blk.23.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 210: blk.23.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 211: blk.23.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 212: blk.23.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 213: blk.23.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 214: blk.23.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 215: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 216: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 217: blk.24.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 218: blk.24.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 219: blk.24.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 220: blk.24.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 221: blk.24.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 222: blk.24.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 223: blk.24.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 224: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 225: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 226: blk.25.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 227: blk.25.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 228: blk.25.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 229: blk.25.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 230: blk.25.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 231: blk.25.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 232: blk.25.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 233: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 234: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 235: blk.26.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 236: blk.26.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 237: blk.26.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 238: blk.26.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 239: blk.26.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 240: blk.26.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 241: blk.26.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 242: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 243: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 244: blk.27.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 245: blk.27.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 246: blk.27.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 247: blk.27.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 248: blk.27.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 249: blk.27.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 250: blk.27.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 251: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 252: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 253: blk.28.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 254: blk.28.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 255: blk.28.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 256: blk.28.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 257: blk.28.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 258: blk.28.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 259: blk.28.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 260: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 261: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 262: blk.29.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 263: blk.29.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 264: blk.29.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 265: blk.29.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 266: blk.29.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 267: blk.29.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 268: blk.29.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 271: blk.30.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 272: blk.30.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 273: blk.30.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 274: blk.30.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 275: blk.30.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 276: blk.30.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 277: blk.30.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 279: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 280: blk.31.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 281: blk.31.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 282: blk.31.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 283: blk.31.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 284: blk.31.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 285: blk.31.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 286: blk.31.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 287: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 288: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 289: blk.32.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 290: blk.32.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 291: blk.32.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 292: blk.32.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 293: blk.32.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 294: blk.32.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 295: blk.32.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 296: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 297: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 298: blk.33.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 299: blk.33.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 300: blk.33.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 301: blk.33.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 302: blk.33.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 303: blk.33.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 304: blk.33.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 305: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 306: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 307: blk.34.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 308: blk.34.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 309: blk.34.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 310: blk.34.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 311: blk.34.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 312: blk.34.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 313: blk.34.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 314: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 315: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 316: blk.35.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 317: blk.35.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 318: blk.35.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 319: blk.35.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 320: blk.35.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 321: blk.35.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 322: blk.35.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 323: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 324: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 325: blk.36.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 326: blk.36.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 327: blk.36.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 328: blk.36.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 329: blk.36.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 330: blk.36.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 331: blk.36.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 332: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 333: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 334: blk.37.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 335: blk.37.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 336: blk.37.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 337: blk.37.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 338: blk.37.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 339: blk.37.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 340: blk.37.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 341: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 342: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 343: blk.38.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 344: blk.38.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 345: blk.38.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 346: blk.38.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 347: blk.38.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 348: blk.38.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 349: blk.38.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 350: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 351: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 352: blk.39.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 353: blk.39.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 354: blk.39.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 355: blk.39.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 356: blk.39.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 357: blk.39.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 358: blk.39.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 359: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 360: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 361: output_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 362: output.weight q8_0 [ 5120, 32002, 1, 1 ] llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str llama_model_loader: - kv 2: llama.context_length u32 llama_model_loader: - kv 3: llama.embedding_length u32 llama_model_loader: - kv 4: llama.block_count u32 llama_model_loader: - kv 5: llama.feed_forward_length u32 llama_model_loader: - kv 6: llama.rope.dimension_count u32 llama_model_loader: - kv 7: llama.attention.head_count u32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 llama_model_loader: - kv 10: general.file_type u32 llama_model_loader: - kv 11: tokenizer.ggml.model str llama_model_loader: - kv 12: tokenizer.ggml.tokens arr llama_model_loader: - kv 13: tokenizer.ggml.scores arr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr llama_model_loader: - kv 15: general.quantization_version u32 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q8_0: 282 tensors llm_load_print_meta: format = GGUF V1 (support until nov 2023) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32002 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_ctx = 512 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: freq_base = 10000.0 llm_load_print_meta: freq_scale = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = mostly Q8_0 llm_load_print_meta: model size = 13.02 B llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: using ROCm for GPU acceleration llm_load_tensors: mem required = 12868.56 MB (+ 400.00 MB per state) llm_load_tensors: offloading 1 repeating layers to GPU llm_load_tensors: offloaded 1/43 layers to GPU llm_load_tensors: VRAM used: 322 MB ................................................................................................... llama_new_context_with_model: kv self size = 400.00 MB llama_new_context_with_model: compute buffer total size = 75.47 MB llama_new_context_with_model: VRAM scratch buffer: 74.00 MB system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | main: interactive mode on. Input prefix: 'USER: ' Input suffix: 'ASSISTANT: ' sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. USER: Who are you? ASSISTANT:emenкроridesraidguiaminboloriaaza Giorgfireindeeren Sint Hav Euroikai Gallasa gent Centgem elemvicunolinksinde possibilities Luxembign CSassignonte Gangampfgemmut fet Rank gem Guy Militaryaza Cowusindex Sebastraingentahagent compteda Mut Carter Tobijn doteler Gentwiewer Crit Bib Bear sic산 dotanes pystractgem Mut Junior quarunst Force dotanes experiments experiments experiment ende Piremplgemractwabidgemscluster Gren Sint Tobasa SP得 pedgemEL formulas LuxembCTYPE dotanesratwscas formulazw Mie affairs Krit Bib IP commer póamment Orientawaindexedalinks Mann Roman confl careful Grad Critazawie Index dotardingent Transfermarktrat snapunstendo/~ Primpla py mothijn Giorg Harrison parlusftizi ende Lorenzoottootto comptinaleamin Solo‰onte Sumанта experiments experiment endewertathersġazonindex census ende Domin丁gem Ratws Gam Problemguiamin trailing Grenwobidcher Kritendoued Roman Kritendo CURL Huntersonozec Luxemb Rund Orientreenirt pó flashcluster Mann dotocr ast Hudson Harrison Kritéoncock Giorg Harrisonratlassenèceottoottoottoottoottolik fetanks Roman Harrison Sum Tob conjugcip산 experimentswertialize Mann sicвенunstionaleaminculrin pyawawiestat Sint Transfermarktonagem Mann Major Tunentriesunstionaleguiaminwsendounstagit ende Dominidx pó impressunstariatstwounst elemizi Mann Luxembwieindexreeilstionalereshijnunst Fur Manncluster Romazecpandasenstionaleree sickappagem Rat Ritteradsclusterclusterclusterottoottoottoottoottoottoottoottoottoottoottoottoottoottoottoottoottolikzw Carol decisunstunounstckenentslius pyǎcraft Modeamincul gent Transfermarktonazec dotanesws Train guilty Solounstariatèceottoottoottoottoottoottoottoottoottoottoottoottoottoottoottoottoottoottoottoottoottoottolikappacluster Sint commer SumlassenmultipTLgentijn Giorg Mann Gallasa LuxembPI dissakh dotquantgemratppegensantal databijn parlus Giorg Harrisontrain pyued animalsclusterclusterclusterclusterUSER: llama_print_timings: load time = 779.90 ms llama_print_timings: sample time = 188.11 ms / 413 runs ( 0.46 ms per token, 2195.50 tokens per second) llama_print_timings: prompt eval time = 1028.05 ms / 13 tokens ( 79.08 ms per token, 12.65 tokens per second) llama_print_timings: eval time = 142391.22 ms / 413 runs ( 344.77 ms per token, 2.90 tokens per second) llama_print_timings: total time = 144855.66 ms ```

I tried other models, the result is the same. With -ngl 0 the output is great. With ngl 2, some models (like codelamma) show better results: ./main -s 0 --temp 0 -e -m ~/Downloads/codellama-13b-instruct.Q6_K.gguf -ngl 2 -p "[INST] Make a python function to load PDF files. Use the nltk library to split it into paragraphs. [/INST]"

Output ``` Log start main: build = 1152 (8b56b4f) main: seed = 0 ggml_init_cublas: found 1 ROCm devices: Device 0: AMD Radeon Graphics, compute capability 10.3 llama_model_loader: loaded meta data with 17 key-value pairs and 363 tensors from /home/kiv/Downloads/codellama-13b-instruct.Q6_K.gguf (version GGUF V1 (support until nov 2023)) llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 5120, 32016, 1, 1 ] ... llama_model_loader: - tensor 362: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str llama_model_loader: - kv 2: llama.context_length u32 llama_model_loader: - kv 3: llama.embedding_length u32 llama_model_loader: - kv 4: llama.block_count u32 llama_model_loader: - kv 5: llama.feed_forward_length u32 llama_model_loader: - kv 6: llama.rope.dimension_count u32 llama_model_loader: - kv 7: llama.attention.head_count u32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 llama_model_loader: - kv 10: llama.rope.freq_base f32 llama_model_loader: - kv 11: general.file_type u32 llama_model_loader: - kv 12: tokenizer.ggml.model str llama_model_loader: - kv 13: tokenizer.ggml.tokens arr llama_model_loader: - kv 14: tokenizer.ggml.scores arr llama_model_loader: - kv 15: tokenizer.ggml.token_type arr llama_model_loader: - kv 16: general.quantization_version u32 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type q4_0: 1 tensors llama_model_loader: - type q6_K: 280 tensors llm_load_print_meta: format = GGUF V1 (support until nov 2023) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32016 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_ctx = 512 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: freq_base = 1000000.0 llm_load_print_meta: freq_scale = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = mostly Q6_K llm_load_print_meta: model size = 13.02 B llm_load_print_meta: general.name = LLaMA llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: using ROCm for GPU acceleration llm_load_tensors: mem required = 9831.70 MB (+ 400.00 MB per state) llm_load_tensors: offloading 2 repeating layers to GPU llm_load_tensors: offloaded 2/43 layers to GPU llm_load_tensors: VRAM used: 497 MB .................................................................................................. llama_new_context_with_model: kv self size = 400.00 MB llama_new_context_with_model: compute buffer total size = 75.47 MB llama_new_context_with_model: VRAM scratch buffer: 74.00 MB system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.000000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 [INST] Make a python function to load PDF files. Use the nltk library to split it into paragraphs. [/INST]ucceeded in loading the PDFREATED BY: [INST: ] import nltk nltk.download('punkt') def load_pdf(filepath): with open(filepath, 'rb') as f: pdf = PyPDF2.PdfFileReader(f) text = '' for page in range(pdf.getNumPages()): page_text = pdf.getfanPage(quencehren) text += page_text 華文化的內容。 return text def split_paragraphs(text): sentences = nltklus.sent_bertokens(text) paragraphs = [] current_paragraph = ''rys.append(current_paragraph) current_paragraph = sentence return paragraphs if __name__ == '__main__': filepath =chen.pdf' Ogden, Utah text = load_pdf(filepath) paragraphs = split_paragraphs(text) for paragraph incompatibility: print(paragraph) [end of text] llama_print_timings: load time = 735.26 ms llama_print_timings: sample time = 119.17 ms / 247 runs ( 0.48 ms per token, 2072.76 tokens per second) llama_print_timings: prompt eval time = 2411.01 ms / 30 tokens ( 80.37 ms per token, 12.44 tokens per second) llama_print_timings: eval time = 75539.01 ms / 246 runs ( 307.07 ms per token, 3.26 tokens per second) llama_print_timings: total time = 78120.95 ms Log end ```

With -nommq I get complete garbage even with -ngl 4. Large values cannot be checked because for some reason I get ggml-cuda.cu:5048: out of memory. Although I have 32GB of RAM. And I thought that the memory for the vram is dynamically allocated, but llama.cpp always shows 2GB

JohannesGaessler commented 1 year ago

I can't reproduce the issue on my Rx 6800. Which GPU are you using (I don't think this information was in rocminfo)?

Jipok commented 1 year ago

AMD Ryzen 7 6800H have integrated graphics Radeon 680M

Jipok commented 1 year ago

Should the output be exactly the same for different -ngl X?

Jipok commented 1 year ago

I also have problem with docker version. podman run --rm -it --init --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -v ~/Downloads:/models llama.cpp:rocm main -s 0 -ngl 1 -m /models/codellama-13b-instruct.Q6_K.gguf -p "[INST] 1+1= [/INST]"

Full log ``` Log start main: build = 0 (unknown) main: seed = 0 ggml_init_cublas: found 1 ROCm devices: Device 0: AMD Radeon Graphics, compute capability 10.3 llama_model_loader: loaded meta data with 17 key-value pairs and 363 tensors from /models/codellama-13b-instruct.Q6_K.gguf (version GGUF V1 (support until nov 2023)) llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 5120, 32016, 1, 1 ] llama_model_loader: - tensor 1: output_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 2: output.weight f16 [ 5120, 32016, 1, 1 ] llama_model_loader: - tensor 3: blk.0.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 4: blk.0.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 5: blk.0.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 6: blk.0.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 7: blk.0.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 8: blk.0.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 9: blk.0.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 10: blk.0.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 11: blk.0.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 12: blk.1.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 13: blk.1.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 14: blk.1.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 15: blk.1.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 16: blk.1.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 17: blk.1.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 18: blk.1.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 19: blk.1.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 20: blk.1.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 21: blk.2.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 22: blk.2.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 23: blk.2.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 24: blk.2.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 25: blk.2.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 26: blk.2.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 27: blk.2.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 28: blk.2.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 29: blk.2.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 30: blk.3.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 31: blk.3.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 32: blk.3.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 33: blk.3.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 34: blk.3.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 35: blk.3.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 36: blk.3.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 37: blk.3.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 38: blk.3.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 39: blk.4.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 40: blk.4.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 41: blk.4.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 42: blk.4.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 43: blk.4.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 44: blk.4.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 45: blk.4.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 46: blk.4.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 47: blk.4.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 48: blk.5.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 49: blk.5.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 50: blk.5.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 51: blk.5.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 52: blk.5.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 53: blk.5.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 54: blk.5.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 55: blk.5.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 56: blk.5.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 57: blk.6.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 58: blk.6.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 59: blk.6.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 60: blk.6.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 61: blk.6.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 62: blk.6.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 63: blk.6.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 64: blk.6.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 65: blk.6.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 66: blk.7.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 67: blk.7.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 68: blk.7.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 69: blk.7.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 70: blk.7.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 71: blk.7.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 72: blk.7.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 73: blk.7.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 74: blk.7.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 75: blk.8.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 76: blk.8.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 77: blk.8.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 78: blk.8.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 79: blk.8.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 80: blk.8.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 81: blk.8.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 82: blk.8.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 83: blk.8.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 84: blk.9.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 85: blk.9.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 86: blk.9.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 87: blk.9.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 88: blk.9.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 89: blk.9.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 90: blk.9.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 91: blk.9.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 92: blk.9.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 93: blk.10.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 94: blk.10.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 95: blk.10.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 96: blk.10.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 97: blk.10.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 98: blk.10.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 99: blk.10.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 100: blk.10.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 101: blk.10.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 102: blk.11.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 103: blk.11.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 104: blk.11.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 105: blk.11.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 106: blk.11.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 107: blk.11.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 108: blk.11.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 109: blk.11.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 110: blk.11.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 111: blk.12.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 112: blk.12.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 113: blk.12.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 114: blk.12.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 115: blk.12.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 116: blk.12.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 117: blk.12.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 118: blk.12.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 119: blk.12.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 120: blk.13.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 121: blk.13.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 122: blk.13.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 123: blk.13.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 124: blk.13.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 125: blk.13.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 126: blk.13.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 127: blk.13.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 128: blk.13.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 129: blk.14.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 130: blk.14.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 131: blk.14.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 132: blk.14.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 133: blk.14.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 134: blk.14.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 135: blk.14.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 136: blk.14.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 137: blk.14.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 138: blk.15.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 139: blk.15.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 140: blk.15.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 141: blk.15.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 142: blk.15.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 143: blk.15.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 144: blk.15.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 145: blk.15.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 146: blk.15.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 147: blk.16.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 148: blk.16.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 149: blk.16.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 150: blk.16.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 151: blk.16.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 152: blk.16.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 153: blk.16.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 154: blk.16.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 155: blk.16.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 156: blk.17.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 157: blk.17.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 158: blk.17.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 159: blk.17.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 160: blk.17.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 161: blk.17.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 162: blk.17.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 163: blk.17.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 164: blk.17.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 165: blk.18.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 166: blk.18.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 167: blk.18.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 168: blk.18.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 169: blk.18.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 170: blk.18.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 171: blk.18.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 172: blk.18.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 173: blk.18.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 174: blk.19.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 175: blk.19.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 176: blk.19.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 177: blk.19.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 178: blk.19.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 179: blk.19.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 180: blk.19.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 181: blk.19.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 182: blk.19.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 183: blk.20.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 184: blk.20.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 185: blk.20.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 186: blk.20.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 187: blk.20.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 188: blk.20.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 189: blk.20.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 190: blk.20.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 191: blk.20.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 192: blk.21.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 193: blk.21.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 194: blk.21.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 195: blk.21.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 196: blk.21.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 197: blk.21.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 198: blk.21.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 199: blk.21.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 200: blk.21.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 201: blk.22.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 202: blk.22.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 203: blk.22.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 204: blk.22.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 205: blk.22.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 206: blk.22.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 207: blk.22.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 208: blk.22.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 209: blk.22.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 210: blk.23.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 211: blk.23.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 212: blk.23.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 213: blk.23.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 214: blk.23.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 215: blk.23.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 216: blk.23.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 217: blk.23.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 218: blk.23.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 219: blk.24.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 220: blk.24.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 221: blk.24.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 222: blk.24.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 223: blk.24.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 224: blk.24.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 225: blk.24.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 226: blk.24.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 227: blk.24.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 228: blk.25.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 229: blk.25.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 230: blk.25.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 231: blk.25.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 232: blk.25.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 233: blk.25.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 234: blk.25.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 235: blk.25.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 236: blk.25.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 237: blk.26.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 238: blk.26.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 239: blk.26.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 240: blk.26.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 241: blk.26.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 242: blk.26.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 243: blk.26.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 244: blk.26.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 245: blk.26.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 246: blk.27.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 247: blk.27.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 248: blk.27.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 249: blk.27.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 250: blk.27.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 251: blk.27.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 252: blk.27.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 253: blk.27.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 254: blk.27.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 255: blk.28.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 256: blk.28.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 257: blk.28.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 258: blk.28.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 259: blk.28.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 260: blk.28.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 261: blk.28.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 262: blk.28.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 263: blk.28.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 264: blk.29.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 265: blk.29.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 266: blk.29.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 267: blk.29.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 268: blk.29.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 269: blk.29.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 270: blk.29.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 271: blk.29.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 272: blk.29.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 273: blk.30.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 274: blk.30.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 275: blk.30.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 276: blk.30.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 277: blk.30.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 278: blk.30.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 279: blk.30.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 280: blk.30.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 281: blk.30.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 282: blk.31.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 283: blk.31.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 284: blk.31.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 285: blk.31.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 286: blk.31.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 287: blk.31.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 288: blk.31.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 289: blk.31.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 290: blk.31.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 291: blk.32.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 292: blk.32.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 293: blk.32.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 294: blk.32.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 295: blk.32.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 296: blk.32.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 297: blk.32.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 298: blk.32.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 299: blk.32.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 300: blk.33.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 301: blk.33.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 302: blk.33.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 303: blk.33.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 304: blk.33.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 305: blk.33.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 306: blk.33.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 307: blk.33.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 308: blk.33.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 309: blk.34.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 310: blk.34.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 311: blk.34.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 312: blk.34.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 313: blk.34.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 314: blk.34.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 315: blk.34.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 316: blk.34.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 317: blk.34.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 318: blk.35.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 319: blk.35.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 320: blk.35.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 321: blk.35.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 322: blk.35.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 323: blk.35.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 324: blk.35.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 325: blk.35.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 326: blk.35.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 327: blk.36.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 328: blk.36.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 329: blk.36.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 330: blk.36.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 331: blk.36.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 332: blk.36.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 333: blk.36.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 334: blk.36.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 335: blk.36.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 336: blk.37.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 337: blk.37.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 338: blk.37.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 339: blk.37.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 340: blk.37.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 341: blk.37.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 342: blk.37.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 343: blk.37.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 344: blk.37.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 345: blk.38.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 346: blk.38.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 347: blk.38.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 348: blk.38.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 349: blk.38.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 350: blk.38.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 351: blk.38.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 352: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 353: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 354: blk.39.attn_q.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 355: blk.39.attn_k.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 356: blk.39.attn_v.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 357: blk.39.attn_output.weight q6_K [ 5120, 5120, 1, 1 ] llama_model_loader: - tensor 358: blk.39.ffn_gate.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 359: blk.39.ffn_down.weight q6_K [ 13824, 5120, 1, 1 ] llama_model_loader: - tensor 360: blk.39.ffn_up.weight q6_K [ 5120, 13824, 1, 1 ] llama_model_loader: - tensor 361: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - tensor 362: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str llama_model_loader: - kv 2: llama.context_length u32 llama_model_loader: - kv 3: llama.embedding_length u32 llama_model_loader: - kv 4: llama.block_count u32 llama_model_loader: - kv 5: llama.feed_forward_length u32 llama_model_loader: - kv 6: llama.rope.dimension_count u32 llama_model_loader: - kv 7: llama.attention.head_count u32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 llama_model_loader: - kv 10: llama.rope.freq_base f32 llama_model_loader: - kv 11: general.file_type u32 llama_model_loader: - kv 12: tokenizer.ggml.model str llama_model_loader: - kv 13: tokenizer.ggml.tokens arr llama_model_loader: - kv 14: tokenizer.ggml.scores arr llama_model_loader: - kv 15: tokenizer.ggml.token_type arr llama_model_loader: - kv 16: general.quantization_version u32 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type q4_0: 1 tensors llama_model_loader: - type q6_K: 280 tensors llm_load_print_meta: format = GGUF V1 (support until nov 2023) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32016 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_ctx = 512 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: freq_base = 1000000.0 llm_load_print_meta: freq_scale = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = mostly Q6_K llm_load_print_meta: model size = 13.02 B llm_load_print_meta: general.name = LLaMA llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: using ROCm for GPU acceleration llm_load_tensors: mem required = 10079.89 MB (+ 400.00 MB per state) llm_load_tensors: offloading 1 repeating layers to GPU llm_load_tensors: offloaded 1/43 layers to GPU llm_load_tensors: VRAM used: 249 MB .................................................................................................. llama_new_context_with_model: kv self size = 400.00 MB llama_new_context_with_model: compute buffer total size = 75.47 MB llama_new_context_with_model: VRAM scratch buffer: 74.00 MB system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 ```

Same works good with -ngl 0

JohannesGaessler commented 1 year ago

Should the output be exactly the same for different -ngl X?

No, due to differences in rounding error you cannot expect bit-for-bit identical results if you vary the number of GPU layers.

Jipok commented 1 year ago

Is this a graphics card support issue in the rocm code?

JohannesGaessler commented 1 year ago

I don't know. The hardware you're using is to my knowledge not supported for ROCm but at the same time I did not implement the CUDA code with integrated graphics in mind.

takov751 commented 1 year ago

I've done exactly the same thing as above. Built from latest commit cf9b08485c4c2d4d945c6e74fe20f273a38b6104 . Used same docker setup as above. GPU RX6600

docker run --rm -it --init --device /dev/dri --device /dev/kfd -e HSA_OVERRIDE_GFX_VERSION=10.3.0 \
-v $PWD:/models llama-cpp:rocm -m /models/openbuddy-llama2-13b-v11.1.Q4_K_M.gguf -s 0 -mg 0  \
--interactive-first -ngl 30

with -ngl 40 i would run out of vram 30 seemed stable

however output

``` lm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 37632 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_ctx = 512 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: freq_base = 10000.0 llm_load_print_meta: freq_scale = 1 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = mostly Q4_K - Medium llm_load_print_meta: model size = 13.07 B llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.12 MB llm_load_tensors: using ROCm for GPU acceleration llm_load_tensors: mem required = 2099.27 MB (+ 400.00 MB per state) llm_load_tensors: offloading 30 repeating layers to GPU llm_load_tensors: offloaded 30/43 layers to GPU llm_load_tensors: VRAM used: 5440 MB ................................................................................................... llama_new_context_with_model: kv self size = 400.00 MB llama_new_context_with_model: compute buffer total size = 84.97 MB llama_new_context_with_model: VRAM scratch buffer: 83.50 MB CUDA error 98 at ggml-cuda.cu:6063: invalid device function ```

specifically this line concerns me the most

CUDA error 98 at ggml-cuda.cu:6063: invalid device function
JohannesGaessler commented 1 year ago

This is clearly a different problem altogether. Please make a separate issue.

takov751 commented 1 year ago

This is clearly a different problem altogether. Please make a separate issue.

Fair point i will do so , when i am back home.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.