ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.2k stars 9.64k forks source link

Cannot load Bloom-7b1 ggml model in GPU #3697

Closed zolastro closed 12 months ago

zolastro commented 1 year ago

I used the convert-bloom-hf-to-gguf.py file to convert the Huggingface bigscience/bloom-7b1 to a ggml model with f16 successfully:

python convert-bloom-hf-to-gguf.py models/bloom-7b1/ 1

This gives me a model ggml-model-f16.gguf that correctly loads and run in CPU. However, when I try to offload a layer on the GPU, I get the following error:

GGML_ASSERT: /llama.cpp/ggml-cuda.cu:6115: false

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      46 bits physical, 48 bits virtual
CPU(s):                             64
On-line CPU(s) list:                0-63
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          2
NUMA node(s):                       2
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              79
Model name:                         Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
Stepping:                           1
CPU MHz:                            1197.469
CPU max MHz:                        3000.0000
CPU min MHz:                        1200.0000
BogoMIPS:                           4190.27
Virtualization:                     VT-x
L1d cache:                          1 MiB
L1i cache:                          1 MiB
L2 cache:                           8 MiB
L3 cache:                           80 MiB
NUMA node0 CPU(s):                  0-15,32-47
NUMA node1 CPU(s):                  16-31,48-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: Split huge pages
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT vulnerable
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp l
                                    m constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3
                                     sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fa
                                    ult epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle 
                                    avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln p
                                    ts md_clear flush_l1d

Linux nemo 5.4.0-165-generic #182-Ubuntu SMP Mon Oct 2 19:43:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Python 3.10.13
GNU Make 4.2.1
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Clone bloom7b1 from Huggingface (https://huggingface.co/bigscience/bloom-7b1)
  2. Use the convert-bloom-hf-to-gguf.py to convert to f16 ggml.
  3. Try to load the model on GPU:
    ./build/bin/main -m models/bloom-7b1/ggml-model-f16.gguf -n 256 -b 512 -c 512  -f ../prompt.txt --threads 32 --temp 0.1 --top-p 0.75 --top-k 40 -cb -ngl 33

    Failure Logs

    ./build/bin/main -m models/bloom-7b1/ggml-model-f16.gguf -n 256 -b 512 -c 512  -f ../prompt.txt --threads 32 --temp 0.1 --top-p 0.75 --top-k 40 -cb -ngl 33
    Log start
    main: build = 1399 (004797f)
    main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
    main: seed  = 1697803662
    ggml_init_cublas: found 6 CUDA devices:
    Device 0: NVIDIA TITAN RTX, compute capability 7.5
    Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5
    Device 2: NVIDIA TITAN Xp, compute capability 6.1
    Device 3: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
    Device 4: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
    Device 5: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
    llama_model_loader: loaded meta data with 19 key-value pairs and 366 tensors from models/bloom-7b1/ggml-model-f16.gguf (version GGUF V2 (latest))
    llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 250880,     1,     1 ]
    llama_model_loader: - tensor    1:                    output.weight f16      [  4096, 250880,     1,     1 ]
    llama_model_loader: - tensor    2:           token_embd_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor    3:             token_embd_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor    4:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor    5:             blk.0.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor    6:            blk.0.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor    7:              blk.0.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor    8:         blk.0.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor    9:           blk.0.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   10:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   11:              blk.0.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   12:              blk.0.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor   13:                blk.0.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor   14:            blk.0.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor   15:              blk.0.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   16:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   17:             blk.1.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   18:            blk.1.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor   19:              blk.1.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor   20:         blk.1.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor   21:           blk.1.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   22:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   23:              blk.1.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   24:              blk.1.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor   25:                blk.1.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor   26:            blk.1.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor   27:              blk.1.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   28:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   29:             blk.2.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   30:            blk.2.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor   31:              blk.2.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor   32:         blk.2.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor   33:           blk.2.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   34:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   35:              blk.2.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   36:              blk.2.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor   37:                blk.2.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor   38:            blk.2.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor   39:              blk.2.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   40:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   41:             blk.3.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   42:            blk.3.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor   43:              blk.3.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor   44:         blk.3.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor   45:           blk.3.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   46:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   47:              blk.3.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   48:              blk.3.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor   49:                blk.3.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor   50:            blk.3.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor   51:              blk.3.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   52:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   53:             blk.4.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   54:            blk.4.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor   55:              blk.4.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor   56:         blk.4.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor   57:           blk.4.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   58:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   59:              blk.4.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   60:              blk.4.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor   61:                blk.4.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor   62:            blk.4.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor   63:              blk.4.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   64:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   65:             blk.5.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   66:            blk.5.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor   67:              blk.5.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor   68:         blk.5.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor   69:           blk.5.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   70:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   71:              blk.5.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   72:              blk.5.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor   73:                blk.5.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor   74:            blk.5.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor   75:              blk.5.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   76:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   77:             blk.6.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   78:            blk.6.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor   79:              blk.6.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor   80:         blk.6.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor   81:           blk.6.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   82:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   83:              blk.6.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   84:              blk.6.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor   85:                blk.6.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor   86:            blk.6.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor   87:              blk.6.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   88:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   89:             blk.7.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   90:            blk.7.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor   91:              blk.7.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor   92:         blk.7.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor   93:           blk.7.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   94:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   95:              blk.7.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor   96:              blk.7.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor   97:                blk.7.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor   98:            blk.7.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor   99:              blk.7.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  100:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  101:             blk.8.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  102:            blk.8.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  103:              blk.8.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  104:         blk.8.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  105:           blk.8.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  106:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  107:              blk.8.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  108:              blk.8.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  109:                blk.8.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  110:            blk.8.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  111:              blk.8.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  112:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  113:             blk.9.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  114:            blk.9.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  115:              blk.9.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  116:         blk.9.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  117:           blk.9.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  118:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  119:              blk.9.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  120:              blk.9.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  121:                blk.9.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  122:            blk.9.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  123:              blk.9.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  124:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  125:            blk.10.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  126:           blk.10.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  127:             blk.10.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  128:        blk.10.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  129:          blk.10.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  130:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  131:             blk.10.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  132:             blk.10.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  133:               blk.10.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  134:           blk.10.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  135:             blk.10.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  136:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  137:            blk.11.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  138:           blk.11.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  139:             blk.11.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  140:        blk.11.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  141:          blk.11.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  142:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  143:             blk.11.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  144:             blk.11.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  145:               blk.11.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  146:           blk.11.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  147:             blk.11.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  148:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  149:            blk.12.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  150:           blk.12.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  151:             blk.12.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  152:        blk.12.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  153:          blk.12.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  154:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  155:             blk.12.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  156:             blk.12.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  157:               blk.12.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  158:           blk.12.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  159:             blk.12.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  160:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  161:            blk.13.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  162:           blk.13.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  163:             blk.13.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  164:        blk.13.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  165:          blk.13.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  166:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  167:             blk.13.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  168:             blk.13.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  169:               blk.13.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  170:           blk.13.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  171:             blk.13.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  172:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  173:            blk.14.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  174:           blk.14.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  175:             blk.14.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  176:        blk.14.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  177:          blk.14.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  178:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  179:             blk.14.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  180:             blk.14.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  181:               blk.14.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  182:           blk.14.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  183:             blk.14.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  184:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  185:            blk.15.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  186:           blk.15.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  187:             blk.15.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  188:        blk.15.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  189:          blk.15.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  190:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  191:             blk.15.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  192:             blk.15.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  193:               blk.15.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  194:           blk.15.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  195:             blk.15.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  196:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  197:            blk.16.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  198:           blk.16.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  199:             blk.16.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  200:        blk.16.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  201:          blk.16.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  202:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  203:             blk.16.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  204:             blk.16.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  205:               blk.16.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  206:           blk.16.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  207:             blk.16.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  208:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  209:            blk.17.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  210:           blk.17.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  211:             blk.17.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  212:        blk.17.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  213:          blk.17.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  214:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  215:             blk.17.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  216:             blk.17.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  217:               blk.17.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  218:           blk.17.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  219:             blk.17.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  220:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  221:            blk.18.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  222:           blk.18.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  223:             blk.18.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  224:        blk.18.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  225:          blk.18.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  226:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  227:             blk.18.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  228:             blk.18.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  229:               blk.18.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  230:           blk.18.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  231:             blk.18.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  232:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  233:            blk.19.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  234:           blk.19.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  235:             blk.19.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  236:        blk.19.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  237:          blk.19.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  238:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  239:             blk.19.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  240:             blk.19.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  241:               blk.19.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  242:           blk.19.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  243:             blk.19.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  244:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  245:            blk.20.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  246:           blk.20.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  247:             blk.20.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  248:        blk.20.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  249:          blk.20.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  250:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  251:             blk.20.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  252:             blk.20.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  253:               blk.20.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  254:           blk.20.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  255:             blk.20.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  256:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  257:            blk.21.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  258:           blk.21.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  259:             blk.21.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  260:        blk.21.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  261:          blk.21.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  262:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  263:             blk.21.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  264:             blk.21.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  265:               blk.21.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  266:           blk.21.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  267:             blk.21.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  268:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  269:            blk.22.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  270:           blk.22.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  271:             blk.22.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  272:        blk.22.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  273:          blk.22.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  274:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  275:             blk.22.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  276:             blk.22.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  277:               blk.22.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  278:           blk.22.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  279:             blk.22.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  280:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  281:            blk.23.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  282:           blk.23.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  283:             blk.23.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  284:        blk.23.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  285:          blk.23.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  286:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  287:             blk.23.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  288:             blk.23.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  289:               blk.23.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  290:           blk.23.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  291:             blk.23.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  292:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  293:            blk.24.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  294:           blk.24.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  295:             blk.24.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  296:        blk.24.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  297:          blk.24.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  298:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  299:             blk.24.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  300:             blk.24.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  301:               blk.24.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  302:           blk.24.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  303:             blk.24.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  304:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  305:            blk.25.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  306:           blk.25.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  307:             blk.25.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  308:        blk.25.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  309:          blk.25.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  310:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  311:             blk.25.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  312:             blk.25.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  313:               blk.25.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  314:           blk.25.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  315:             blk.25.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  316:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  317:            blk.26.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  318:           blk.26.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  319:             blk.26.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  320:        blk.26.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  321:          blk.26.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  322:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  323:             blk.26.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  324:             blk.26.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  325:               blk.26.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  326:           blk.26.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  327:             blk.26.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  328:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  329:            blk.27.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  330:           blk.27.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  331:             blk.27.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  332:        blk.27.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  333:          blk.27.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  334:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  335:             blk.27.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  336:             blk.27.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  337:               blk.27.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  338:           blk.27.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  339:             blk.27.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  340:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  341:            blk.28.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  342:           blk.28.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  343:             blk.28.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  344:        blk.28.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  345:          blk.28.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  346:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  347:             blk.28.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  348:             blk.28.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  349:               blk.28.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  350:           blk.28.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  351:             blk.28.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  352:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  353:            blk.29.attn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  354:           blk.29.attn_qkv.weight f16      [  4096, 12288,     1,     1 ]
    llama_model_loader: - tensor  355:             blk.29.attn_qkv.bias f32      [ 12288,     1,     1,     1 ]
    llama_model_loader: - tensor  356:        blk.29.attn_output.weight f16      [  4096,  4096,     1,     1 ]
    llama_model_loader: - tensor  357:          blk.29.attn_output.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  358:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  359:             blk.29.ffn_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  360:             blk.29.ffn_up.weight f16      [  4096, 16384,     1,     1 ]
    llama_model_loader: - tensor  361:               blk.29.ffn_up.bias f32      [ 16384,     1,     1,     1 ]
    llama_model_loader: - tensor  362:           blk.29.ffn_down.weight f16      [ 16384,  4096,     1,     1 ]
    llama_model_loader: - tensor  363:             blk.29.ffn_down.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  364:               output_norm.weight f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - tensor  365:                 output_norm.bias f32      [  4096,     1,     1,     1 ]
    llama_model_loader: - kv   0:                       general.architecture str     
    llama_model_loader: - kv   1:                               general.name str     
    llama_model_loader: - kv   2:                       bloom.context_length u32     
    llama_model_loader: - kv   3:                     bloom.embedding_length u32     
    llama_model_loader: - kv   4:                  bloom.feed_forward_length u32     
    llama_model_loader: - kv   5:                          bloom.block_count u32     
    llama_model_loader: - kv   6:                 bloom.attention.head_count u32     
    llama_model_loader: - kv   7:              bloom.attention.head_count_kv u32     
    llama_model_loader: - kv   8:         bloom.attention.layer_norm_epsilon f32     
    llama_model_loader: - kv   9:                          general.file_type u32     
    llama_model_loader: - kv  10:                       tokenizer.ggml.model str     
    llama_model_loader: - kv  11:                      tokenizer.ggml.tokens arr     
    llama_model_loader: - kv  12:                      tokenizer.ggml.scores arr     
    llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr     
    llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr     
    llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
    llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
    llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
    llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32     
    llama_model_loader: - type  f32:  244 tensors
    llama_model_loader: - type  f16:  122 tensors
    llm_load_vocab: mismatch in special tokens definition ( 203/250880 vs 0/250880 ).
    llm_load_print_meta: format           = GGUF V2 (latest)
    llm_load_print_meta: arch             = bloom
    llm_load_print_meta: vocab type       = BPE
    llm_load_print_meta: n_vocab          = 250880
    llm_load_print_meta: n_merges         = 250434
    llm_load_print_meta: n_ctx_train      = 4096
    llm_load_print_meta: n_embd           = 4096
    llm_load_print_meta: n_head           = 32
    llm_load_print_meta: n_head_kv        = 32
    llm_load_print_meta: n_layer          = 30
    llm_load_print_meta: n_rot            = 128
    llm_load_print_meta: n_gqa            = 1
    llm_load_print_meta: f_norm_eps       = 1.0e-05
    llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
    llm_load_print_meta: f_clamp_kqv      = 0.0e+00
    llm_load_print_meta: f_max_alibi_bias = 0.0e+00
    llm_load_print_meta: n_ff             = 16384
    llm_load_print_meta: freq_base_train  = 10000.0
    llm_load_print_meta: freq_scale_train = 1
    llm_load_print_meta: model type       = 7B
    llm_load_print_meta: model ftype      = mostly F16
    llm_load_print_meta: model params     = 8.10 B
    llm_load_print_meta: model size       = 15.08 GiB (16.00 BPW) 
    llm_load_print_meta: general.name   = Bloom
    llm_load_print_meta: BOS token = 1 '<s>'
    llm_load_print_meta: EOS token = 2 '</s>'
    llm_load_print_meta: UNK token = 0 '<unk>'
    llm_load_print_meta: PAD token = 3 '<pad>'
    llm_load_print_meta: LF token  = 130 'Ä'
    llm_load_tensors: ggml ctx size =    0.12 MB
    llm_load_tensors: using CUDA for GPU acceleration
    ggml_cuda_set_main_device: using device 0 (NVIDIA TITAN RTX) as main device
    llm_load_tensors: mem required  = 1960.15 MB
    llm_load_tensors: offloading 30 repeating layers to GPU
    llm_load_tensors: offloading non-repeating layers to GPU
    llm_load_tensors: offloaded 33/33 layers to GPU
    llm_load_tensors: VRAM used: 13486.12 MB
    ...GGML_ASSERT: /llama.cpp/ggml-cuda.cu:6115: false
    Aborted (core dumped)
QueryType commented 1 year ago

I am not able to load llava model ggml-model-q4_k.gguf, mmproj mmproj-model-f16.gguf, into GPU too. (main) llama v2 is working fine. `llama.cpp % ./run_llava.sh /Volumes/d/shm/output/2023-10-20/00000-3438987449-swapped.png "describe the image" clip_model_load: model name: openai/clip-vit-large-patch14-336 clip_model_load: description: image encoder for LLaVA clip_model_load: GGUF version: 2 clip_model_load: alignment: 32 clip_model_load: n_tensors: 377 clip_model_load: n_kv: 18 clip_model_load: ftype: f16

clip_model_load: text_encoder: 0 clip_model_load: vision_encoder: 1 clip_model_load: llava_projector: 1 clip_model_load: model size: 595.61 MB clip_model_load: metadata size: 0.13 MB clip_model_load: total allocated memory: 201.27 MB llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /Volumes/d/apps/aimodels/llama2/hf_models/llava/ggml-model-q4_k.gguf (version GGUF V2 (latest)) llama_model_loader: - tensor 0: token_embd.weight q4_K [ 4096, 32000, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 3: blk.0.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 4: blk.0.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 6: blk.0.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 7: blk.0.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 10: blk.1.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 11: blk.1.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 12: blk.1.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 13: blk.1.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 14: blk.1.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 15: blk.1.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 16: blk.1.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 19: blk.2.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 20: blk.2.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 21: blk.2.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 22: blk.2.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 23: blk.2.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 24: blk.2.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 25: blk.2.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 28: blk.3.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 29: blk.3.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 30: blk.3.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 31: blk.3.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 32: blk.3.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 33: blk.3.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 34: blk.3.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 37: blk.4.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 38: blk.4.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 39: blk.4.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 40: blk.4.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 41: blk.4.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 42: blk.4.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 43: blk.4.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 46: blk.5.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 47: blk.5.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 48: blk.5.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 49: blk.5.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 50: blk.5.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 51: blk.5.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 52: blk.5.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 53: blk.5.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 54: blk.5.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 55: blk.6.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 56: blk.6.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 57: blk.6.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 58: blk.6.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 59: blk.6.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 60: blk.6.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 61: blk.6.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 62: blk.6.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 63: blk.6.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 64: blk.7.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 65: blk.7.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 66: blk.7.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 67: blk.7.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 68: blk.7.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 69: blk.7.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 70: blk.7.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 71: blk.7.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 72: blk.7.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 73: blk.8.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 74: blk.8.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 75: blk.8.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 76: blk.8.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 77: blk.8.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 78: blk.8.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 79: blk.8.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 80: blk.8.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 81: blk.8.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 82: blk.9.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 83: blk.9.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 84: blk.9.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 85: blk.9.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 86: blk.9.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 87: blk.9.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 88: blk.9.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 89: blk.9.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 90: blk.9.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 91: blk.10.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 92: blk.10.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 93: blk.10.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 94: blk.10.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 95: blk.10.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 96: blk.10.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 97: blk.10.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 98: blk.10.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 99: blk.10.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 100: blk.11.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 101: blk.11.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 102: blk.11.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 103: blk.11.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 104: blk.11.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 105: blk.11.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 106: blk.11.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 107: blk.11.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 108: blk.11.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 109: blk.12.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 110: blk.12.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 111: blk.12.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 112: blk.12.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 113: blk.12.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 114: blk.12.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 115: blk.12.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 116: blk.12.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 117: blk.12.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 118: blk.13.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 119: blk.13.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 120: blk.13.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 121: blk.13.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 122: blk.13.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 123: blk.13.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 124: blk.13.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 127: blk.14.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 128: blk.14.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 129: blk.14.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 130: blk.14.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 131: blk.14.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 132: blk.14.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 133: blk.14.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 134: blk.14.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 135: blk.14.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 136: blk.15.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 137: blk.15.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 138: blk.15.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 139: blk.15.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 140: blk.15.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 141: blk.15.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 142: blk.15.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 143: blk.15.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 144: blk.15.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 145: blk.16.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 146: blk.16.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 147: blk.16.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 148: blk.16.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 149: blk.16.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 150: blk.16.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 151: blk.16.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 152: blk.16.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 153: blk.16.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 154: blk.17.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 155: blk.17.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 156: blk.17.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 157: blk.17.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 158: blk.17.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 159: blk.17.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 160: blk.17.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 161: blk.17.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 162: blk.17.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 163: blk.18.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 164: blk.18.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 165: blk.18.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 166: blk.18.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 167: blk.18.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 168: blk.18.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 169: blk.18.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 170: blk.18.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 171: blk.18.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 172: blk.19.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 173: blk.19.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 174: blk.19.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 175: blk.19.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 176: blk.19.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 177: blk.19.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 178: blk.19.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 179: blk.19.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 180: blk.19.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 181: blk.20.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 182: blk.20.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 183: blk.20.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 184: blk.20.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 185: blk.20.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 186: blk.20.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 187: blk.20.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 189: blk.20.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 190: blk.21.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 191: blk.21.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 192: blk.21.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 193: blk.21.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 194: blk.21.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 195: blk.21.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 196: blk.21.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 197: blk.21.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 198: blk.21.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 199: blk.22.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 200: blk.22.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 201: blk.22.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 202: blk.22.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 203: blk.22.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 204: blk.22.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 205: blk.22.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 206: blk.22.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 207: blk.22.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 208: blk.23.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 209: blk.23.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 210: blk.23.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 211: blk.23.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 212: blk.23.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 213: blk.23.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 214: blk.23.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 215: blk.23.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 216: blk.23.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 217: blk.24.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 218: blk.24.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 219: blk.24.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 220: blk.24.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 221: blk.24.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 222: blk.24.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 223: blk.24.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 224: blk.24.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 225: blk.24.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 226: blk.25.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 227: blk.25.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 228: blk.25.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 229: blk.25.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 230: blk.25.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 231: blk.25.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 232: blk.25.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 233: blk.25.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 234: blk.25.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 235: blk.26.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 236: blk.26.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 237: blk.26.attn_v.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 238: blk.26.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 239: blk.26.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 240: blk.26.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 241: blk.26.ffn_down.weight q4_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 242: blk.26.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 243: blk.26.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 244: blk.27.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 245: blk.27.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 246: blk.27.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 247: blk.27.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 248: blk.27.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 249: blk.27.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 250: blk.27.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 251: blk.27.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 252: blk.27.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 253: blk.28.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 254: blk.28.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 255: blk.28.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 256: blk.28.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 257: blk.28.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 258: blk.28.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 259: blk.28.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 260: blk.28.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 261: blk.28.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 262: blk.29.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 263: blk.29.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 264: blk.29.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 265: blk.29.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 266: blk.29.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 267: blk.29.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 268: blk.29.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 271: blk.30.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 272: blk.30.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 273: blk.30.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 274: blk.30.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 275: blk.30.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 276: blk.30.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 277: blk.30.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 279: blk.30.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 280: blk.31.attn_q.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 281: blk.31.attn_k.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 282: blk.31.attn_v.weight q6_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 283: blk.31.attn_output.weight q4_K [ 4096, 4096, 1, 1 ] llama_model_loader: - tensor 284: blk.31.ffn_gate.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 285: blk.31.ffn_up.weight q4_K [ 4096, 11008, 1, 1 ] llama_model_loader: - tensor 286: blk.31.ffn_down.weight q6_K [ 11008, 4096, 1, 1 ] llama_model_loader: - tensor 287: blk.31.attn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 288: blk.31.ffn_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 289: output_norm.weight f32 [ 4096, 1, 1, 1 ] llama_model_loader: - tensor 290: output.weight q6_K [ 4096, 32000, 1, 1 ] llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: general.file_type u32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = mostly Q4_K - Medium llm_load_print_meta: model params = 6.74 B llm_load_print_meta: model size = 3.80 GiB (4.84 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.10 MB llm_load_tensors: mem required = 3891.34 MB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 1024.00 MB llama_new_context_with_model: compute buffer total size = 162.13 MB

prompt: 'describe the image'`

nlpcat commented 1 year ago

is there any solution for this? I found that models with alibi all seem to have this issue on nvidia gpu. it can run successfully on metal.

ggerganov commented 1 year ago

As a temporary workaround, you can add LLAMA_CUDA_MMV_Y=4 to your build and it should work on master. See the discussion in https://github.com/ggerganov/llama.cpp/issues/3740#issuecomment-1783125187 We still need a proper fix though

slaren commented 12 months ago

Fixed in #3921