Bug: Inconsistency while parsing the model using `llama-cli` and `gguf-py`

Lyutoon commented 2 days ago

What happened?

Hi, recently, I'm trying to learn the gguf-py lib and use the gruff-py and write a script to make a gguf file, after I made the file, I tried to load it using llama-cli, but it said I have the wrong tensor number. So I'm wondering if there are some inconsistencies between the cpp loader and the py loader.

Here, my script is:

import os
import re
import ast
import sys
import random
import uuid
import string
import subprocess
from pathlib import Path

import numpy as np

# Necessary to load the local gguf package
sys.path.insert(0, str(Path(__file__).parent.parent))

from gguf import GGUFWriter

writer = GGUFWriter('test.gguf', 'llama')

model='llama'
token_l=11
context_len=123
emb_len=234
bc=1
ff_len=345
hc=10
rms_eps=0.1
tokenizer_model='llama'

token_list = random.sample(string.printable, token_l)
writer.add_token_list(token_list)
writer.add_context_length(context_len)
writer.add_embedding_length(emb_len)
writer.add_block_count(bc)
writer.add_feed_forward_length(ff_len)
writer.add_head_count(hc)
writer.add_layer_norm_rms_eps(rms_eps)
writer.add_tokenizer_model(tokenizer_model)

## Here are 18 tensors
writer.add_tensor('token_embd.weight', np.random.uniform(0, 10, [11, 234]), raw_dtype=0)
writer.add_tensor('output_norm.weight', np.random.uniform(0, 10, [234]), raw_dtype=0)
writer.add_tensor('output.weight', np.random.uniform(0, 10, [11, 234]), raw_dtype=0)
writer.add_tensor('rope_freqs.weight', np.random.uniform(0, 10, [11]), raw_dtype=0)
writer.add_tensor('blk.0.attn_norm.weight', np.random.uniform(0, 10, [234]), raw_dtype=0)
writer.add_tensor('blk.0.attn_q.weight', np.random.uniform(0, 10, [230, 234]), raw_dtype=0)
writer.add_tensor('blk.0.attn_k.weight', np.random.uniform(0, 10, [230, 234]), raw_dtype=0)
writer.add_tensor('blk.0.attn_v.weight', np.random.uniform(0, 10, [230, 234]), raw_dtype=0)
writer.add_tensor('blk.0.attn_output.weight', np.random.uniform(0, 10, [234, 230]), raw_dtype=0)
writer.add_tensor('blk.0.attn_rot_embd.weight', np.random.uniform(0, 10, [123]), raw_dtype=0)
writer.add_tensor('blk.0.ffn_gate_inp.weight', np.random.uniform(0, 10, [123]), raw_dtype=0)
writer.add_tensor('blk.0.ffn_norm.weight', np.random.uniform(0, 10, [234]), raw_dtype=0)
writer.add_tensor('blk.0.ffn_gate.weight', np.random.uniform(0, 10, [345, 234]), raw_dtype=0)
writer.add_tensor('blk.0.ffn_down.weight', np.random.uniform(0, 10, [234, 345]), raw_dtype=0)
writer.add_tensor('blk.0.ffn_up.weight', np.random.uniform(0, 10, [345, 234]), raw_dtype=0)
writer.add_tensor('blk.0.ffn_gate_exps.weight', np.random.uniform(0, 10, [123]), raw_dtype=0)
writer.add_tensor('blk.0.ffn_down_exps.weight', np.random.uniform(0, 10, [123]), raw_dtype=0)
writer.add_tensor('blk.0.ffn_up_exps.weight', np.random.uniform(0, 10, [123]), raw_dtype=0)

writer.write_header_to_file()
writer.write_kv_data_to_file()
writer.write_tensors_to_file()
writer.close()

##### verify the inconsistency #####
os.system('../../llama-cli -m ./test.gguf -p "hello" -n 5 -e')

os.system('python3 reader.py ./test.gguf')

then, you can see the result like:

build: 3909 (11ac9800) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 9 key-value pairs and 18 tensors from ./test.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
", "3", "...loader: - kv   1:                      tokenizer.ggml.tokens arr[str,11]      = ["e", "!", "A", "w", "f", "
llama_model_loader: - kv   2:                       llama.context_length u32              = 123
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 234
llama_model_loader: - kv   4:                          llama.block_count u32              = 1
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 345
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 10
llama_model_loader: - kv   7:     llama.attention.layer_norm_rms_epsilon f32              = 0.100000
llama_model_loader: - kv   8:                       tokenizer.ggml.model str              = llama
llama_model_loader: - type  f32:   18 tensors
llm_load_vocab: SPM vocabulary, but newline token not found: _Map_base::at! Using special_pad_id instead.llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 0
llm_load_vocab: token to piece cache size = 0.0000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 11
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 123
llm_load_print_meta: n_embd           = 234
llm_load_print_meta: n_layer          = 1
llm_load_print_meta: n_head           = 10
llm_load_print_meta: n_head_kv        = 10
llm_load_print_meta: n_rot            = 23
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 23
llm_load_print_meta: n_embd_head_v    = 23
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 230
llm_load_print_meta: n_embd_v_gqa     = 230
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-01
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 345
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 123
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 463.95 K
llm_load_print_meta: model size       = 1.77 MiB (32.00 BPW) 
llm_load_print_meta: general.name     = n/a
llm_load_print_meta: BOS token        = 1 '!'
llm_load_print_meta: EOS token        = 2 'A'
llm_load_print_meta: UNK token        = 0 'e'
llm_load_print_meta: EOG token        = 2 'A'
llm_load_print_meta: max token length = 1
llm_load_tensors: ggml ctx size =    0.01 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 18, got 13
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model './test.gguf'
main: error: unable to load model
Key-Value Pairs:
GGUF.version                           : [3]
GGUF.tensor_count                      : [18]
GGUF.kv_count                          : [9]
general.architecture                   : [108 108  97 109  97]
tokenizer.ggml.tokens                  : [101]
llama.context_length                   : [123]
llama.embedding_length                 : [234]
llama.block_count                      : [1]
llama.feed_forward_length              : [345]
llama.attention.head_count             : [10]
llama.attention.layer_norm_rms_epsilon : [0.1]
tokenizer.ggml.model                   : [108 108  97 109  97]
----
Tensors:
Tensor Name                    | Shape: Shape           | Size: Size         | Quantization: Quantization
--------------------------------------------------------------------------------
token_embd.weight              | Shape: 234x11          | Size: 2574         | Quantization: F32
output_norm.weight             | Shape: 234             | Size: 234          | Quantization: F32
output.weight                  | Shape: 234x11          | Size: 2574         | Quantization: F32
rope_freqs.weight              | Shape: 11              | Size: 11           | Quantization: F32
blk.0.attn_norm.weight         | Shape: 234             | Size: 234          | Quantization: F32
blk.0.attn_q.weight            | Shape: 234x230         | Size: 53820        | Quantization: F32
blk.0.attn_k.weight            | Shape: 234x230         | Size: 53820        | Quantization: F32
blk.0.attn_v.weight            | Shape: 234x230         | Size: 53820        | Quantization: F32
blk.0.attn_output.weight       | Shape: 230x234         | Size: 53820        | Quantization: F32
blk.0.attn_rot_embd.weight     | Shape: 123             | Size: 123          | Quantization: F32
blk.0.ffn_gate_inp.weight      | Shape: 123             | Size: 123          | Quantization: F32
blk.0.ffn_norm.weight          | Shape: 234             | Size: 234          | Quantization: F32
blk.0.ffn_gate.weight          | Shape: 234x345         | Size: 80730        | Quantization: F32
blk.0.ffn_down.weight          | Shape: 345x234         | Size: 80730        | Quantization: F32
blk.0.ffn_up.weight            | Shape: 234x345         | Size: 80730        | Quantization: F32
blk.0.ffn_gate_exps.weight     | Shape: 123             | Size: 123          | Quantization: F32
blk.0.ffn_down_exps.weight     | Shape: 123             | Size: 123          | Quantization: F32
blk.0.ffn_up_exps.weight       | Shape: 123             | Size: 123          | Quantization: F32

As we can see that the py-reader identified 18 tensors while llama-cli only know that there are 18 tensors but only identified 13 tensors.

I'm wondering what's going wrong with my script. Could you please help me to figure out? Thanks a lot!

Name and Version

build: 3909 (11ac9800) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

slaren commented 1 day ago

You need to set the experts metadata for llama.cpp to use the experts tensors, otherwise it will ignore these tensors and lead to this error. attn_rot_embd should also be removed.

Lyutoon commented 1 day ago

Oh! thanks for the reply. How can I use these experts metadata in gguf-py ?

slaren commented 1 day ago

Look into the way convert_hf_to_gguf.py does it.

Lyutoon commented 1 day ago

Thanks! I’ll have a look and try!

ggerganov / llama.cpp