ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.12k stars 9.18k forks source link

Bug: After converting the InternLM2 7b from LLamaFactory and importing it into ollama, i get an error: tensor 'token_embd.weight' has wrong shape. #8445

Open Sakura4036 opened 1 month ago

Sakura4036 commented 1 month ago

What happened?

I fine-tuned the InternLM2 7b-chat model in LLamaFactory using a custom dataset and lora, exported the safetenors model and converted it to gguf format using convert_hf_to_gguf.py script, and finally imported it into ollama to run it, and ollama reported an error:

Error: llama runner process has terminated: signal: aborted (core dumped) error loading model: check_tensor_dims: tensor 'token_embd.weight' has wrong shape; expected  4096, 92550, got  4096, 92544,     1,     1
llama_load_model_from_file: exception loading model

python convert_hf_to_gguf log

INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:Set model tokenizer
WARNING:hf-to-gguf:InternLM2 convert token 'b'\x00'' to '🐉'!
WARNING:hf-to-gguf:Replace eos:2 with a special token:92542 in chat mode so that the conversation can end normally.
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 92542
INFO:gguf.vocab:Setting special token type unk to 0
INFO:gguf.vocab:Setting special token type pad to 2
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting add_eos_token to False
...
INFO:hf-to-gguf:output_norm.weight,          torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:output.weight,               torch.bfloat16 --> F16, shape = {4096, 92544}
...

quantize q4_0 log

...
llama_model_loader: - kv   0:                       general.architecture str              = internlm2
llama_model_loader: - kv   1:                               general.name str              = InternLM2
llama_model_loader: - kv   2:                   internlm2.context_length u32              = 32768
llama_model_loader: - kv   3:                      internlm2.block_count u32              = 32
llama_model_loader: - kv   4:                 internlm2.embedding_length u32              = 4096
llama_model_loader: - kv   5:              internlm2.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                   internlm2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   7:             internlm2.attention.head_count u32              = 32
llama_model_loader: - kv   8: internlm2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv   9:          internlm2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,92550]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,92550]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,92550]   = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 92542
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ '<s>' }}{% if messages[0]['role'] ...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
...

Name and Version

llama souce code version: b549a1bbefb2f1fbb8b558bac1f2ae7967e60964

What operating system are you seeing the problem on?

Linux

Relevant log output

### convert_hf_to_gguf

...
INFO:hf-to-gguf:blk.31.attn_output.weight,   torch.bfloat16 --> F16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.31.attn_q.weight,        torch.bfloat16 --> F16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.31.attn_k.weight,        torch.bfloat16 --> F16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.31.attn_v.weight,        torch.bfloat16 --> F16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.31.attn_norm.weight,     torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.31.ffn_gate.weight,      torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.31.ffn_down.weight,      torch.bfloat16 --> F16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.31.ffn_up.weight,        torch.bfloat16 --> F16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.31.ffn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:output_norm.weight,          torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:output.weight,               torch.bfloat16 --> F16, shape = {4096, 92544}
...
JohannesGaessler commented 1 month ago

Does this also happen when using only llama.cpp code?

Sakura4036 commented 1 month ago

Does this also happen when using only llama.cpp code?

yes. I try to run the example code:

llama-cli -m my_model.gguf -p "I believe the meaning of life is" -n 128

get error:

...
llm_load_vocab: special tokens cache size = 457
llm_load_vocab: token to piece cache size = 0.5532 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = internlm2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 92550
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 7.74 B
llm_load_print_meta: model size       = 14.41 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = InternLM2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 92542 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 92542 '<|im_end|>'
llm_load_print_meta: max token length = 384
llm_load_tensors: ggml ctx size =    0.14 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' has wrong shape; expected  4096, 92550, got  4096, 92544,     1,     1
llama_load_model_from_file: failed to load model
...
Sakura4036 commented 1 month ago

when i do the same sft training with qwen2-7b, llamafactory and llama.cpp both works fine, and the converted gguf model are availablle in ollama.

So, i think this is a bug for InternLM2 model which need a check and fix. @JohannesGaessler

compilade commented 1 month ago

llm_load_print_meta: n_vocab = 92550 INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {4096, 92544}

@Sakura4036 The vocab size does not match the tensor size. Try to modify the vocab_size field in config.json to make it match, then re-convert the model.

Sakura4036 commented 1 month ago

llm_load_print_meta: n_vocab = 92550 INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {4096, 92544}

@Sakura4036 The vocab size does not match the tensor size. Try to modify the vocab_size field in config.json to make it match, then re-convert the model.

I tried to modify the vocab_size field in config.json from 92544 to 92550, and re-converted the model by convert_hf_to_gguf.py, but get an error:

Traceback (most recent call last):
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3583, in <module>
    main()
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3567, in main
    model_instance.set_vocab()
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 2129, in set_vocab
    piece = tokenizer.IdToPiece(token_id)
  File "/home/anaconda3/envs/llama/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1179, in _batched_func
    return _func(self, arg)
  File "/home/anaconda3/envs/llama/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1172, in _func
    raise IndexError('piece id is out of range.')
IndexError: piece id is out of range.
compilade commented 1 month ago

I tried to modify the vocab_size field in config.json from 92544 to 92550

I meant to set it to 92544, to match the tensor size, but from what you say it was already that?

n_vocab comes from the number of tokens here:

llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,92550]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...

https://github.com/ggerganov/llama.cpp/blob/5e116e8dd51775f8f1c090570be148d5d7eea6c3/src/llama.cpp#L4653

So I would have guessed that setting vocab_size to 92544 to match the {4096, 92544}-sized tensor would have helped.

compilade commented 1 month ago

@Sakura4036 Do you happen to have an added_tokens.json file in the same directory as the model? This seems like the only other thing than the vocab_size field which could affect the resulting vocab size.

Sakura4036 commented 1 month ago

I tried to modify the vocab_size field in config.json from 92544 to 92550

I meant to set it to 92544, to match the tensor size, but from what you say it was already that?

n_vocab comes from the number of tokens here:

llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,92550]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...

https://github.com/ggerganov/llama.cpp/blob/5e116e8dd51775f8f1c090570be148d5d7eea6c3/src/llama.cpp#L4653

So I would have guessed that setting vocab_size to 92544 to match the {4096, 92544}-sized tensor would have helped.

Yes, when vocab_size is 92544 (and it was originally), using convert_hf_to_gguf.py doesn't report an error, but the gguf model doesn't work, i.e., the error I showed at the beginning

Sakura4036 commented 1 month ago

@Sakura4036 Other guess, do you happen to have an added_tokens.json file in the same directory as the model? This seems like the only other thing than the vocab_size field which could affect the resulting vocab size.

Yes, an add_tokens.json file does exist in the exported model folder. Should I delete it?

{
  "[UNUSED_TOKEN_141]": 92544,
  "[UNUSED_TOKEN_142]": 92545,
  "[UNUSED_TOKEN_143]": 92546,
  "[UNUSED_TOKEN_144]": 92547,
  "[UNUSED_TOKEN_145]": 92548,
  "[UNUSED_TOKEN_146]": 92549
}
compilade commented 1 month ago

Yes, an add_tokens.json file does exist in the exported model folder. Should I delete it?

Yes you can delete it (or you can rename the file to something else). These unused tokens don't map to anything in the model (according to the tensor sizes), and this is what makes n_vocab bigger than it should.

Sakura4036 commented 1 month ago

Yes, an add_tokens.json file does exist in the exported model folder. Should I delete it?

Yes you can delete it (or you can rename the file to something else). These unused tokens don't map to anything in the model (according to the tensor sizes), and this is what makes n_vocab bigger than it should.

but these tokens also exist in tokenizer.json and tokenizer_config.json file, makes a error after deleting the added_token.json file.

Traceback (most recent call last):
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3583, in <module>
    main()
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3567, in main
    model_instance.set_vocab()
  File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 2179, in set_vocab
    if toktypes[token_id] != SentencePieceTokenTypes.UNKNOWN:
IndexError: list index out of range
Sakura4036 commented 1 month ago

I wonder if this is a bug in the LLamaFactory export for InternLM2 model.

Sakura4036 commented 1 month ago

I delete added_token.json file and delete the six added token in tokenizer.json and tokenizer_config.json in model export folder. After that, I re-converted the exported model to gguf format, and import it in ollama, and ollama run it success. But the model performance is far from what it was before the llamaFactory merge adapter and export the model.

marcello-sousa commented 1 month ago

Same problem.

SulRash commented 3 weeks ago

I have the issue happening with me after I finetuned gemma-2-2b-it and tried to convert its lora...

Jesean commented 4 days ago

@vansinhu same problem here. I finetuned the Internlm2_5-20b-chat model with xtuner and convert it to gguf with llama.cpp, then run it with ollama. And got the same problem, this issue block my full workflow of trying to use InternLM

euclaise commented 1 day ago

I'm getting this when attempting to convert the base Internlm2_5-20b, without finetuning.

Removing the tokens for now. I haven't tested performance, but I don't see any theoretical reason for why there would be a gap.

@Sakura4036 I wonder if your performance gap is because you're finetuning with the additional tokens being used as ChatML tokens, such that removing them results in normal text tokenization, which mismatches your training tokenization.

Sakura4036 commented 1 day ago

@euclaise Why does this error occur if you just convert the base model without fine-tuning it?