Open Sakura4036 opened 1 month ago
Does this also happen when using only llama.cpp code?
Does this also happen when using only llama.cpp code?
yes. I try to run the example code:
llama-cli -m my_model.gguf -p "I believe the meaning of life is" -n 128
get error:
...
llm_load_vocab: special tokens cache size = 457
llm_load_vocab: token to piece cache size = 0.5532 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = internlm2
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 92550
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 7.74 B
llm_load_print_meta: model size = 14.41 GiB (16.00 BPW)
llm_load_print_meta: general.name = InternLM2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 92542 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 2 '</s>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOT token = 92542 '<|im_end|>'
llm_load_print_meta: max token length = 384
llm_load_tensors: ggml ctx size = 0.14 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' has wrong shape; expected 4096, 92550, got 4096, 92544, 1, 1
llama_load_model_from_file: failed to load model
...
when i do the same sft
training with qwen2-7b
, llamafactory
and llama.cpp
both works fine, and the converted gguf model are availablle in ollama
.
So, i think this is a bug for InternLM2
model which need a check and fix. @JohannesGaessler
llm_load_print_meta: n_vocab = 92550
INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {4096, 92544}
@Sakura4036 The vocab size does not match the tensor size.
Try to modify the vocab_size
field in config.json
to make it match, then re-convert the model.
llm_load_print_meta: n_vocab = 92550
INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {4096, 92544}
@Sakura4036 The vocab size does not match the tensor size. Try to modify the
vocab_size
field inconfig.json
to make it match, then re-convert the model.
I tried to modify the vocab_size
field in config.json
from 92544
to 92550
, and re-converted the model by convert_hf_to_gguf.py
, but get an error:
Traceback (most recent call last):
File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3583, in <module>
main()
File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3567, in main
model_instance.set_vocab()
File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 2129, in set_vocab
piece = tokenizer.IdToPiece(token_id)
File "/home/anaconda3/envs/llama/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1179, in _batched_func
return _func(self, arg)
File "/home/anaconda3/envs/llama/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1172, in _func
raise IndexError('piece id is out of range.')
IndexError: piece id is out of range.
I tried to modify the
vocab_size
field inconfig.json
from92544
to92550
I meant to set it to 92544, to match the tensor size, but from what you say it was already that?
n_vocab
comes from the number of tokens here:
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,92550] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
So I would have guessed that setting vocab_size
to 92544
to match the {4096, 92544}
-sized tensor would have helped.
@Sakura4036 Do you happen to have an added_tokens.json
file in the same directory as the model? This seems like the only other thing than the vocab_size
field which could affect the resulting vocab size.
I tried to modify the
vocab_size
field inconfig.json
from92544
to92550
I meant to set it to 92544, to match the tensor size, but from what you say it was already that?
n_vocab
comes from the number of tokens here:llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,92550] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
So I would have guessed that setting
vocab_size
to92544
to match the{4096, 92544}
-sized tensor would have helped.
Yes, when vocab_size
is 92544
(and it was originally), using convert_hf_to_gguf.py doesn't report an error, but the gguf
model doesn't work, i.e., the error I showed at the beginning
@Sakura4036 Other guess, do you happen to have an
added_tokens.json
file in the same directory as the model? This seems like the only other thing than thevocab_size
field which could affect the resulting vocab size.
Yes, an add_tokens.json
file does exist in the exported model folder. Should I delete it?
{
"[UNUSED_TOKEN_141]": 92544,
"[UNUSED_TOKEN_142]": 92545,
"[UNUSED_TOKEN_143]": 92546,
"[UNUSED_TOKEN_144]": 92547,
"[UNUSED_TOKEN_145]": 92548,
"[UNUSED_TOKEN_146]": 92549
}
Yes, an
add_tokens.json
file does exist in the exported model folder. Should I delete it?
Yes you can delete it (or you can rename the file to something else). These unused tokens don't map to anything in the model (according to the tensor sizes), and this is what makes n_vocab
bigger than it should.
Yes, an
add_tokens.json
file does exist in the exported model folder. Should I delete it?Yes you can delete it (or you can rename the file to something else). These unused tokens don't map to anything in the model (according to the tensor sizes), and this is what makes
n_vocab
bigger than it should.
but these tokens also exist in tokenizer.json
and tokenizer_config.json
file, makes a error after deleting the added_token.json
file.
Traceback (most recent call last):
File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3583, in <module>
main()
File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 3567, in main
model_instance.set_vocab()
File "/home/github/llama.cpp/convert_hf_to_gguf.py", line 2179, in set_vocab
if toktypes[token_id] != SentencePieceTokenTypes.UNKNOWN:
IndexError: list index out of range
I wonder if this is a bug in the LLamaFactory export for InternLM2
model.
I delete added_token.json
file and delete the six added token in tokenizer.json
and tokenizer_config.json
in model export folder. After that, I re-converted the exported model to gguf format, and import it in ollama, and ollama run it success. But the model performance is far from what it was before the llamaFactory merge adapter and export the model.
Same problem.
I have the issue happening with me after I finetuned gemma-2-2b-it and tried to convert its lora...
@vansinhu same problem here. I finetuned the Internlm2_5-20b-chat model with xtuner and convert it to gguf with llama.cpp, then run it with ollama. And got the same problem, this issue block my full workflow of trying to use InternLM
I'm getting this when attempting to convert the base Internlm2_5-20b, without finetuning.
Removing the tokens for now. I haven't tested performance, but I don't see any theoretical reason for why there would be a gap.
@Sakura4036 I wonder if your performance gap is because you're finetuning with the additional tokens being used as ChatML tokens, such that removing them results in normal text tokenization, which mismatches your training tokenization.
@euclaise Why does this error occur if you just convert the base model without fine-tuning it?
What happened?
I fine-tuned the InternLM2 7b-chat model in LLamaFactory using a custom dataset and lora, exported the safetenors model and converted it to gguf format using
convert_hf_to_gguf.py
script, and finally imported it into ollama to run it, and ollama reported an error:python convert_hf_to_gguf log
quantize q4_0 log
Name and Version
llama souce code version: b549a1bbefb2f1fbb8b558bac1f2ae7967e60964
What operating system are you seeing the problem on?
Linux
Relevant log output