ggerganov / llama.cpp

LLM inference in C/C++
MIT License
61.3k stars 8.76k forks source link

Could not find tokenizer.model in llama2 #3256

Closed muhammadfhadli1453 closed 9 months ago

muhammadfhadli1453 commented 9 months ago

When I ran this command:

python convert.py \
    llama2-summarizer-id-2/final_merged_checkpoint \
    --outtype f16 \
    --outfile llama2-summarizer-id-2/final_merged_checkpoint/llama2-summarizer-id-2.gguf.fp16.bin

I encountered the following error:

Loading model file llama2-summarizer-id-2/final_merged_checkpoint/model-00001-of-00002.safetensors
Loading model file llama2-summarizer-id-2/final_merged_checkpoint/model-00001-of-00002.safetensors
Loading model file llama2-summarizer-id-2/final_merged_checkpoint/model-00002-of-00002.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, f_norm_eps=1e-05, f_rope_freq_base=None, f_rope_scale=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('llama2-summarizer-id-2/final_merged_checkpoint'))
Traceback (most recent call last):
  File "llama.cpp/convert.py", line 1209, in <module>
    main()
  File "llama.cpp/convert.py", line 1191, in main
    vocab = load_vocab(vocab_dir, args.vocabtype)
  File "llama.cpp/convert.py", line 1092, in load_vocab
    raise FileNotFoundError(
FileNotFoundError: Could not find tokenizer.model in llama2-summarizer-id-2/final_merged_checkpoint or its parent; if it's in another directory, pass the directory as --vocab-dir

After training the llama2 model, I do not have a "tokenizer.model" file. Instead, the model directory contains the following files:

$ ls llama2-summarizer-id-2/final_merged_checkpoint/
config.json             model-00001-of-00002.safetensors  model.safetensors.index.json  tokenizer_config.json
generation_config.json  model-00002-of-00002.safetensors  special_tokens_map.json       tokenizer.json

What should I do to resolve this issue?

*note: i follow this tutorial for finetuning https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/

Green-Sky commented 9 months ago

just use the original one. if the tokenizer.model is in a different directory, you can use the --vocab-dir argument

muhammadfhadli1453 commented 9 months ago

just use the original one. if the tokenizer.model is in a different directory, you can use the --vocab-dir argument

what do you mean the original one? can you explain please?

KerfuffleV2 commented 9 months ago

He means from the the base model you fine tuned.

muhammadfhadli1453 commented 9 months ago

He means from the the base model you fine tuned.

i see.. but i finetuned the model into different language, will it still works?

KerfuffleV2 commented 9 months ago

i finetuned the model into different language, will it still works?

I think it would depend on whether you made changes to the vocabulary in addition to training (like adding tokens, etc). If it was just training, then I believe it would work. I'm not 100% sure about this though.

yximkt commented 8 months ago

same questions here.

shibe2 commented 8 months ago

same questions here.

And the answer was given. Tokenizer model is included in the resulting file, so you need one that matches the model you are trying to convert.

J-Scott-Dav commented 8 months ago

Many (most) of the base models I've seen on Hugging Face do not have a file named tokenizer.model. So I am also having the same issue.

byjlw commented 7 months ago

Same issue

stephenthumb commented 7 months ago

same issue. base model also doesn't have a tokenizer.model, is there a way to get the tokenizer from the huggingface auto tokenizer?

massudy commented 7 months ago

same issue

J-Scott-Dav commented 7 months ago

I have found a solution for this problem. The default vocabtype is 'spm' which invokes a Sentence Piece tokenizer. Some models utilize a Byte-Pair encoding (bpe) tokenizer. To convert a BPE-based model, use this syntax:

convert.py modelname_or_path --vocabtype bpe

swavkulinski commented 5 months ago

I have found a solution for this problem. The default vocabtype is 'spm' which invokes a Sentence Piece tokenizer. Some models utilize a Byte-Pair encoding (bpe) tokenizer. To convert a BPE-based model, use this syntax:

convert.py modelname_or_path --vocabtype bpe

--vocab-type

beeroutine commented 3 months ago

He means from the the base model you fine tuned.

Downloaded llama (all models) model from meta does not have tokenizer. I have same issue.

massudy commented 3 months ago

He means from the the base model you fine tuned.

Downloaded llama (all models) model from meta does not have tokenizer. I have same issue.

go to huggingface and search the model, download the tokenizer separated and move to the folder without the tokenizer

jferments commented 2 months ago

I am here with the same problem trying to convert llama 3 70B. I don't know what is meant by "go to huggingface and search the model, download the tokenizer separated" ... there is no tokenizer.model on the llama3 70B page, and searching for it is turning up nothing. Where can I download the tokenizer for this?

rebeccadifrancesco commented 2 months ago

I am here with the same problem trying to convert llama 3 70B. I don't know what is meant by "go to huggingface and search the model, download the tokenizer separated" ... there is no tokenizer.model on the llama3 70B page, and searching for it is turning up nothing. Where can I download the tokenizer for this?

here: https://huggingface.co/meta-llama/Meta-Llama-3-8B/tree/main/original put the tokenizer.model in your model folder and then use --vocab-type bpe as stated above, it worked for me