ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.3k stars 9.67k forks source link

Error converting fine-tuned Llama2 7B model: Exception: Vocab size mismatch (model has 32000, but ../jarvis-hf/tokenizer.model has 32001). #6111

Closed FotieMConstant closed 5 months ago

FotieMConstant commented 7 months ago

Hi everyone, i have an issue for days not using llama.cpp to convert a fine-tuned model and then quantize it. i am stuck at the conversion phase. when i use the command:

python llama.cpp/convert.py ../jarvis-hf --outtype f16 --outfile converted.bin

Here is the error i get:

Writing converted.bin, format 1
Traceback (most recent call last):
  File "/Users/🤓/jarvis/ollama/llm/llama.cpp/convert.py", line 1466, in <module>
    main()
  File "/Users/🤓/jarvis/ollama/llm/llama.cpp/convert.py", line 1460, in main
    OutputFile.write_all(outfile, ftype, params, model, vocab, special_vocab,
  File "/Users/🤓/jarvis/ollama/llm/llama.cpp/convert.py", line 1117, in write_all
    check_vocab_size(params, vocab, pad_vocab=pad_vocab)
  File "/Users/🤓/jarvis/ollama/llm/llama.cpp/convert.py", line 963, in check_vocab_size
    raise Exception(msg)
Exception: Vocab size mismatch (model has 32000, but ../jarvis-hf/tokenizer.model has 32001).

Now, i am new to this whole fine-tuning thing and i am a little lost as to what the issue might be here:( I will add my Jupyter notebook code below and a working version of the model as well, the model is on huggingface

Fine-tuning code: https://colab.research.google.com/drive/1FTt_Z1eGOsl2VgPVb8pnM4yUTczhSutM?usp=sharing Working model: https://colab.research.google.com/drive/19ZuropXXc2_jMC_qxqa8MO4mHHxOqxxe?usp=sharing Model on hugginface: https://huggingface.co/fotiecodes/Llama-2-7b-chat-jarvis Original pre-trained model: https://huggingface.co/NousResearch/Llama-2-7b-chat-hf

To reproduce the issue: Download Llama-2-7b-chat-jarvis from huggingface and try to convert it with the convert.py from llama.cpp

Few things to note: When i print the get_vocab size it gives me 32001, so not sure why it isn't working though

OS: Mac OS Sonama, version 14.4 on Apple M1 chip llama.cpp: latest

Artefact2 commented 7 months ago

Looks like a broken model to me. Blame the author.

I could get a working result with --vocab-type hfft and patch below. No guarantees though.

diff --git a/added_tokens.json b/added_tokens.json
index 9c16aa4..0db3279 100644
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -1,3 +1,3 @@
 {
-  "<pad>": 32000
 }
diff --git a/tokenizer.json b/tokenizer.json
index ab74d1c..4afc6a4 100644
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -29,15 +29,6 @@
       "rstrip": false,
       "normalized": true,
       "special": true
-    },
-    {
-      "id": 32000,
-      "content": "<pad>",
-      "single_word": false,
-      "lstrip": false,
-      "rstrip": false,
-      "normalized": true,
-      "special": false
     }
   ],
   "normalizer": {
FotieMConstant commented 7 months ago

Hey @Artefact2 thanks for the heads-up I’ll try that. However, for more guarantee do you think it’s better to get and use the official base model from meta?

FotieMConstant commented 7 months ago

Here, i am here with some feedback, @Artefact2. So, i just tried it and it works like charm, thanks. However, do you think it is better to request access to the original model from Meta. that could be better? maybe?

petergreis commented 6 months ago

So other than modifying the tokenizer.json file, is there another way to fix this? I am working with ChatMusician (based on llama2 7b) and seeing the exact same error...

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.