Open camAtGitHub opened 3 months ago
not just windows -->all Linux flavors:
expected 292, got 291 for all llama 3.1 variations
reports llama.cpp has fixed this.. anyone confirm?
See also https://github.com/SciSharp/LLamaSharp/pull/874
I'm having the same issue with Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
on Linux with llamafile 0.8.12
Same with Meta Llama 3 1 Instruct 8B q4_k_m gguf
, rocm
Similar results with Llama-3.1-8b-Pruned-7-Layers.Q8_0.gguf albeit expected 220 got 219
This issue is probably fixed as of commit e9ee3f93c8abb6156a7f67a75c90af7a834d738d
Could someone please provide step-by-step instructions to resolve this issue? I'm encountering the same problem and would appreciate guidance on how to fix it.
Wait for the next binary release if you don't want to compile it yourself
I tried building it locally from the current master but it isn't working
import_cuda_impl: initializing gpu module...
FLAG_nocompile 0
FLAG_recompile 0
link_cuda_dso: note: dynamically linking /.llamafile/v/0.8.12/ggml-cuda.so
ggml_cuda_link: welcome to CUDA SDK with cuBLAS
link_cuda_dso: GPU support loaded
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2855,"msg":"build info","tid":"11541504","timestamp":1723341134}
{"function":"server_cli","level":"INFO","line":2858,"msg":"system info","n_threads":12,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11541504","timestamp":1723341134,"total_threads":24}
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from Meta-Llama-3.1-8B-Instruct-Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 18
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q6_K: 226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q6_K
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 6.14 GiB (6.56 BPW)
llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.32 MiB
error: Uncaught SIGSEGV (SI_KERNEL) at 0 on blg31 pid 116844 tid 116844
./llamafile/bin/llamafile
No error information
Linux Cosmopolitan 3.6.2 MODE=x86_64; #1 SMP PREEMPT_DYNAMIC Mon Aug 5 16:07:06 EDT 2024 blg31 6.6.41-gentoo-dist
RAX 74612e302e6b6c62 RBX 00007f69a014d240 RDI 00007f69a014d240
RCX 00007f69a014d240 RDX 00007f69a011e020 RSI 00000000811925f0
RBP 00007ffca8f126e0 RSP 00007ffca8f125a0 RIP 00007f69dc78cb7f
R8 0000000000000000 R9 0000000155502000 R10 0000000000001000
R11 0000000000000001 R12 00007f69a014d240 R13 00007f669a000000
R14 00007f669a004000 R15 0000000000000000
TLS 0000000000ae9dc0
XMM0 000000008119145000007f669a000000 XMM8 00007f69d3a2301800007f69d3a23020
XMM1 00000000000000000000000081191450 XMM9 00007f69d3a2302800007f69d3a23030
XMM2 00007f69dc78ba2000007f69dc7502e0 XMM10 00007f69d3a2303800007f69d3a23040
XMM3 00007f69dc78cb1000007f69dc7502f0 XMM11 00007f69d3a2304800007f69d3a23050
XMM4 00007f69dc78c89000007f69dc78c9d0 XMM12 00007f69d3a2305800007f69d3a23060
XMM5 00007f69dc78c73000007f69dc78b830 XMM13 00007f69d3a2306800007f69d3a23070
XMM6 000000000000000000007f69a221eff8 XMM14 00007f69d3a2307800007f69d3a23080
XMM7 000000000000000000007f69a221eff8 XMM15 00000000000000000000000000000000
cosmoaddr2line /part/01/Tmp/kumargau/llamafile/promptslr/llamafile/bin/llamafile 7f69dc78cb7f 5359e4 536695 69d772 704969 7052cc 5a16ab 4ad163 4f8e90 401d40 433272 4015f4
0x00007f69dc78cb7f: ?? ??:0
0x00000000005359e4: ?? ??:0
0x0000000000536695: ?? ??:0
0x000000000069d772: ?? ??:0
0x0000000000704969: ?? ??:0
0x00000000007052cc: ?? ??:0
0x00000000005a16ab: ?? ??:0
0x00000000004ad163: ?? ??:0
0x00000000004f8e90: ?? ??:0
0x0000000000401d40: ?? ??:0
0x0000000000433272: ?? ??:0
0x00000000004015f4: ?? ??:0
000000400000-000000adb1f8 r-x-- 7020kb
000000adc000-00000319a000 rw--- 39mb
0006fe000000-0006fe001000 rw-pa 4096b
7f67f0c00000-7f6979e713c0 r--s- 6290mb
7f69a011d000-7f69a017d000 rw-pa 384kb
7f69a173f000-7f69a4fdf000 rw-pa 57mb
7f69b0005000-7f69b09a5000 rw-pa 10mb
7f69d3c30000-7f69d3e00000 rw-pa 1856kb
7f69d406f000-7f69d415f000 rw-pa 960kb
7f69d5dd0000-7f69d5e00000 rw-pa 192kb
7f69dc5ca000-7f69dc63a000 rw-pa 448kb
7f69dcd91000-7f69dce3225a r--s- 645kb
7f69dce58000-7f69dce88000 rw-pa 192kb
7f69dcee5000-7f69dcfdc3e8 rw-pa 989kb
7f69dcfdd000-7f69dd0de000 rw-pa 1028kb
# 6'728'548'352 bytes in 16 mappings
./llamafile/bin/llamafile -m Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --nobrowser --host somehost --gpu nvidia --n-predict 512
Segmentation fault (core dumped)
This issue has been fixed as of release v0.8.13.
@camAtGitHub want to close the issue now that its fixed?
Contact Details
github
What happened?
I came here to report the issue / bug / my incompetence around the error of:
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291
The logs are below but in particular trying to load external weights (on Windows) for
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
and its variant:DarkIdol-Llama-3.1-8B-Instruct-1.2-Uncensored.Q4_K_M.gguf
result in the above error about 'wrong number of tensors'.Some digging I found that LM studio v0.2.29 is required so I went looking for changes between LM studio 0.2.28 and .29 and at a pure guess it seems that llama 3.1 rope scaling was introduced / patched / fix, which maybe the reason for the error as llamafile doesn't have this yet? (again a guess on my part).
Anyway to reproduce the error it should be easy enough just try:
llamafile-0.8.12.exe -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
(sha1sum ee8c490b5390f3d85e59b2b2c61d83157ce5df73) and see if you get the same error.Additional info
llamafile-0.8.12.exe -m mistral-7b-instruct-v0.1.Q4_K_M.gguf
works fine.llamafile
but does when usingllama-cli
TL;DR
Meta-Llama-3.1-8B-Instruct and variants arent running with
llamafile
failing with errorllama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291
. I believe the issue to be related to changes in the gguf file type/model that have been introduced.Version
llamafile v0.8.12
What operating system are you seeing the problem on?
Windows
Relevant log output