VinAIResearch / PhoGPT

PhoGPT: Generative Pre-training for Vietnamese (2023)
Apache License 2.0
739 stars 67 forks source link

Error wrong number of tensors when serving vinai/PhoGPT-4B-Chat with llama.cpp #22

Closed xtfocus closed 6 months ago

xtfocus commented 6 months ago

I successfully converted the model to the gguf format using llama.cpp convert-hf-to-gguf.py script

cd ~/.models
git clone --progress --verbose https://huggingface.co/vinai/PhoGPT-4B-Chat
cd ~/llama.cpp
python3 convert-hf-to-gguf.py ~/.models/PhoGPT-4B-Chat --outfile ~/.models/pho.gguf

Output

Loading model: PhoGPT-4B-Chat
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer

...

output_norm.bias, n_dims = 1, torch.bfloat16 --> float32
Model successfully exported to '/home/username/.models/pho.gguf'

I tried inference and the error showed up

cd
./llama.cpp/main -m ./.models/pho.gguf -p "xin chào"

Log start
main: build = 2101 (b7b74cef)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1707968957
llama_model_loader: loaded meta data with 18 key-value pairs and 388 tensors from ./models/pho.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mpt
llama_model_loader: - kv   1:                               general.name str              = PhoGPT-4B-Chat
llama_model_loader: - kv   2:                         mpt.context_length u32              = 8192
llama_model_loader: - kv   3:                       mpt.embedding_length u32              = 3072
llama_model_loader: - kv   4:                            mpt.block_count u32              = 32
llama_model_loader: - kv   5:                    mpt.feed_forward_length u32              = 12288
llama_model_loader: - kv   6:                   mpt.attention.head_count u32              = 24
llama_model_loader: - kv   7:           mpt.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   8:               mpt.attention.max_alibi_bias f32              = 8.000000
llama_model_loader: - kv   9:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  10:                      tokenizer.ggml.tokens arr[str,20480]   = ["<unk>", "<s>", "</s>", "<pad>", "!"...
llama_model_loader: - kv  11:                  tokenizer.ggml.token_type arr[i32,20480]   = [3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  12:                      tokenizer.ggml.merges arr[str,20266]   = ["á »", "á º", "Ġ t", "n g", "Ġ...
llama_model_loader: - kv  13:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  14:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  15:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 3
llama_model_loader: - kv  17:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - type  f32:  258 tensors
llama_model_loader: - type  f16:  130 tensors
llm_load_vocab: special tokens definition check successful ( 4/20480 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = mpt
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 20480
llm_load_print_meta: n_merges         = 20266
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 24
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 8.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = all F32 (guessed)
llm_load_print_meta: model params     = 3.75 B
llm_load_print_meta: model size       = 6.99 GiB (16.01 BPW)
llm_load_print_meta: general.name     = PhoGPT-4B-Chat
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 3 '<pad>'
llm_load_print_meta: LF token         = 130 'Ä'
llm_load_tensors: ggml ctx size =    0.15 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 388, got 195
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './.models/pho.gguf'
main: error: unable to load model
nviet commented 6 months ago

That is because phoGPT uses tensors with bias parameter in addition to weight parameter and llama.cpp currently does not support MPT-trained model with such feature. I created a fork of llama.cpp here some months back to make it able to run phoGPT. Perhaps applying same method will also work with latest version of llama.cpp

datquocnguyen commented 6 months ago

Thanks. I will take a look and respond back soon.

nntadotzip commented 6 months ago

@xtfocus I faced the same issue. Then I found @nviet's solution on Huggingface, and it worked!

cd llama-cpp-python
git checkout e9bc4c4baf3f121a178dec215770ccd0ac86c28e
cd vendor
git clone https://github.com/nviet/llama.cpp
cd ..
pip install .
datquocnguyen commented 6 months ago

Thanks @nviet and @xtfocus Here is the PR I just made for llama.cpp to address this issue, following @nviet's commit with minor modifications to make 'bias' parameters optional (i.e. now it works for both MPT and PhoGPT): https://github.com/ggerganov/llama.cpp/pull/5638

datquocnguyen commented 6 months ago

PR was merged. I will close this issue.