Closed woodx9 closed 4 months ago
Try adding --vocab-type bpe
as an opinion. IIRC, I had to do that for deepseek-coder
models.
Try adding
--vocab-type bpe
as an opinion. IIRC, I had to do that fordeepseek-coder
models.
I try that, But I think it's not the right choice, they are two different tokenizer way after all. And I have a question of how to tokenizer them with --no-vocab
Deepseek models support is in progress:
Deepseek models support is in progress:
I have read all of them, thank You! I will see what can I do!
Looking forward to your work!
I am trying to convert https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct using llama.cpp/convert.py I have the following error when trying to convert, similar to the error with deepseek coder mentioned above. I am not able to fix this error, can anyone help?
command: python llama.cpp/convert.py llama3-8b --outfile llama3-8b-8k-f16.gguf --outtype f16
output:
Loading model file llama3-8b/model-00001-of-00004.safetensors
Loading model file llama3-8b/model-00001-of-00004.safetensors
Loading model file llama3-8b/model-00002-of-00004.safetensors
Loading model file llama3-8b/model-00003-of-00004.safetensors
Loading model file llama3-8b/model-00004-of-00004.safetensors
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('llama3-8b'))
Traceback (most recent call last):
File "/Users/charlespaulson/2024/llama_cpp/llama.cpp/convert.py", line 1548, in
add --vocab-type bpe to the command line
That should fix it.
This also applies to using convert.py
for converting the Meta distributed Llama3 files.
Oh, it's been awhile, but I found it!
python convert.py local/models/deepseek-ai/deepseek-coder-6.7b-instruct --vocab-type hfft --pad-vocab
This is the original command I used. You need to use the --vocab-type
and --pad-vocab
options. I forgot why, it was related to PR #3633. You can read the rationale for it here.
The Meta distributed Llama3 files are currently unsupported. I've been working on it all day today to see if I can figure it out.
22:47:15 | /mnt/valerie/remote/ggerganov/llama.cpp
(.venv) git:(master | θ) λ python convert.py /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct --vocab-type bpe
Loading model file /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct/consolidated.00.pth
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('/mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct'))
Traceback (most recent call last):
File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1555, in <module>
main()
File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1522, in main
vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1424, in load_vocab
vocab = self._create_vocab_by_path(vocab_types)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1414, in _create_vocab_by_path
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']
I have no idea what model format Meta used and that's the part I'm stuck on right now. torchtext
also seems to use binary formats, not plaintext BPE formats, so that's why I'm stuck at the moment.
22:55:32 | ~/Local/vocab-model
(.venv) λ bpython
bpython version 0.24 on top of Python 3.11.8 /home/austin/Local/vocab-model/.venv/bin/python
>>> tokenizer_model_path = "/mnt/scsm/models/facebook/llama-3/Meta-Llama-3-8B/tokenizer.model"
>>> tokenizer_model = open(tokenizer_model_path)
>>> vocab = [line.split() for line in tokenizer_model.readlines()]
>>> len(vocab)
128000
>>> vocab[0]
['IQ==', '0']
>>> # This is kind of funny and apropos for how I'm feeling rn, lol
I have a couple ideas, but if anyone knows how to go about this, I'm all ears.
Oh, it's been awhile, but I found it!
python convert.py local/models/deepseek-ai/deepseek-coder-6.7b-instruct --vocab-type hfft --pad-vocab
This is the original command I used. You need to use the
--vocab-type
and--pad-vocab
options. I forgot why, it was related to PR #3633. You can read the rationale for it here.The Meta distributed Llama3 files are currently unsupported. I've been working on it all day today to see if I can figure it out.
22:47:15 | /mnt/valerie/remote/ggerganov/llama.cpp (.venv) git:(master | θ) λ python convert.py /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct --vocab-type bpe Loading model file /mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct/consolidated.00.pth params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('/mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct')) Traceback (most recent call last): File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1555, in <module> main() File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1522, in main vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1424, in load_vocab vocab = self._create_vocab_by_path(vocab_types) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/valerie/remote/ggerganov/llama.cpp/convert.py", line 1414, in _create_vocab_by_path raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}") FileNotFoundError: Could not find a tokenizer matching any of ['bpe']
I have no idea what model format Meta used and that's the part I'm stuck on right now.
torchtext
also seems to use binary formats, not plaintext BPE formats, so that's why I'm stuck at the moment.22:55:32 | ~/Local/vocab-model (.venv) λ bpython bpython version 0.24 on top of Python 3.11.8 /home/austin/Local/vocab-model/.venv/bin/python >>> tokenizer_model_path = "/mnt/scsm/models/facebook/llama-3/Meta-Llama-3-8B/tokenizer.model" >>> tokenizer_model = open(tokenizer_model_path) >>> vocab = [line.split() for line in tokenizer_model.readlines()] >>> len(vocab) 128000 >>> vocab[0] ['IQ==', '0'] >>> # This is kind of funny and apropos for how I'm feeling rn, lol
I have a couple ideas, but if anyone knows how to go about this, I'm all ears.
Does hfft fit with the way Deepseek Tokenizer? I doubt it. Can you give a reason plz?
@woodx9 I didn't create it so you'll need to read the linked rationale.
This issue was closed because it has been inactive for 14 days since being marked as stale.
I am trying to convert deepseek-ai/deepseek-coder-1.3b-base using llama.cpp/convert.py with
Command
python llama.cpp/convert.py codes-hf \ --outfile codes-1b.gguf \ --outtype q8_0
Output:
Loading model file codes-hf/pytorch_model.bin params = Params(n_vocab=32256, n_embd=2048, n_layer=24, n_ctx=16384, n_ff=5504, n_head=16, n_head_kv=16, n_experts=None, n_experts_used=None, f_norm_eps=1e-06, rope_scaling_type=<RopeScalingType.LINEAR: 'linear'>, f_rope_freq_base=100000, f_rope_scale=4.0, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyQ8_0: 7>, path_model=PosixPath('codes-hf')) Traceback (most recent call last): File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1548, in
main()
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1515, in main
vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1417, in load_vocab
vocab = self._create_vocab_by_path(vocab_types)
File "/home/woodx/Workspace/llamacpp/llama.cpp/convert.py", line 1407, in _create_vocab_by_path
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']
the "tokenizer_class": "LlamaTokenizerFast", is there a way to support it?