Closed TheBloke closed 7 months ago
Looks like an old style (i.e. using slow tokenizers) model to me.
Edit: funny, didn't find a mention of merges.txt
in the repository. What are we fighting against?
@TheBloke Assuming you have merges.txt
which looks like:
#version: blah
a b
d ef
etc etc
and a tokenizer.json
that lacks a merges
section, you can try this little script I made:
import json
with open('tokenizer.json', 'r') as fp:
tokenizer = json.load(fp)
merges = []
with open('merges.txt', 'r') as mfp:
firstline = next(mfp).strip()
if not firstline.startswith('#version:'):
merges.append(firstline)
for l in mfp:
l = l.strip()
if len(l) > 0:
merges.append(l)
tokenizer['merges'] = merges
with open('tokenizer.json.new', 'w') as outfp:
json.dump(tokenizer, outfp, indent = 4)
It'll open tokenizer.json
and merges.txt
in the current directory, and then add the merges to the stuff in that tokenizer.json
. The result will get saved to tokenizer.json.new
in the current directory - you can verify if it looks right. The format looks pretty simple at least with the random model I checked. I don't have a way to test it, but I think this should work.
@KerfuffleV2
It works, but the merges.txt
for both models looks like it's damaged or incomplete, last line in that file is not a pair.
I don't have a HF account so I can't look at it myself. I guess TB could try just trimming that last line then? Or change if len(l) > 0:
to if len(l) > 0 and ' ' in l:
to make it just skip lines that don't have at least one space.
From what I recall, before GGUF we didn't even add the merges at all so it'll probably be okay. What are the odds that one merge is the super important one? (With my luck...)
quick edit: Even if it seems to work, probably a bad idea to leave it as is though. I assume that if it doesn't just crash/detect an error then it's going to work like "blah" merges with empty string, which might actually have an effect.
The tokenizer is the same as Qwen Models, they use a tiktoken, and this GPT2FastTokenizer is converted from their vocab.
Their CPP tiktoken implement: https://github.com/QwenLM/qwen.cpp/tree/master/tiktoken_cpp
And their tiktoken vocab: https://huggingface.co/Qwen/Qwen-7B/blob/main/qwen.tiktoken
The converted GPT2 style tokenizer from: https://huggingface.co/JosephusCheung/Qwen-LLaMAfied-7B-Chat/tree/main
But I am still confused, what makes it different from those working BPE tokenized models?
But I am still confused, what makes it different from those working BPE tokenized models?
For the purposes of converting to GGUF in BPE mode, the difference is that it (apparently) doesn't have the merges in a tokenizer
section in tokenizer.json
. We currently only look there and don't consider external sources like merges.txt
at all.
Also the original Qwen as far as I know isn't included in the category "already working BPE tokenized models", there are still some issues open requesting Qwen support. So after fixing/working around this issue there definitely may be more to deal with.
Please test #3743 and see if you can create a functional model. You'll need to use --padvocab
to add the dummy tokens.
Testing now, thanks. Love --padvocab, that's awesome thanks!
Working, thank you! Great work.
...........................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 400.00 MB
llama_new_context_with_model: compute buffer total size = 313.13 MB
llama_new_context_with_model: VRAM scratch buffer: 307.00 MB
llama_new_context_with_model: total VRAM used: 307.00 MB (model: 0.00 MB, context: 307.00 MB)
system_info: n_threads = 15 / 30 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nWrite a story about llamas<|im_end|>\n<|im_start|>assistant:Once upon a time, in the highlands of South America, there lived a group of llamas. These gentle creatures had thick fur to protect them from the cold mountain winds and were well adapted to life at high altitudes.
The llamas lived on a small farm owned by a kind old man named Pedro. Pedro loved his llamas dearly and took great care of them. He would feed them fresh grass every day, clean their pens, and take good care of their health.
One day, Pedro decided to enter his llamas into a local llama show. He worked hard with his llamas for weeks, training them to walk in formation and perform tricks. Finally, the big day arrived, and the llamas were ready to shine.
The day of the show was bright and sunny. The llamas walked proudly in their colorful blankets, following Pedro as he led them around the ring. The audience watched in awe as the llamas performed their tricks, from weaving in and out between each other to lying down on command.
At the end of the show, Pedro was overjoyed when his llamas were awarded first place. From that day on, Pedro and his llamas became famous throughout the region for their incredible skills and beauty.
Years passed, and Pedro grew older. He knew it was time to pass on his love for llamas to the next generation. So he decided to start a llama sanctuary, where people could come and learn about these amazing creatures.
Pedro's llamas continued to live long and happy lives, teaching others about the importance of caring for animals and preserving their habitats. And Pedro's legacy lived on through the love and care that his llamas brought to everyone who met them.<|endoftext|> [end of text]
re-uploading 14B and 7B quants now
7B and 14B quants are tested and re-uploaded
Can't use with text generation webui currently. llama-cpp-python may need a upgrade.
...
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q5_1: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
ERROR: byte not found in vocab: '
'
fish: Job 1, 'python server.py --api --listen…' terminated by signal SIGSEGV (Address boundary error)
If this model is to be supported can we have a tokenizer test, please?
Have error on CUDA GPU:
1
CUDA error 9 at /home/artem/Research/llm/llama/llama.cpp/ggml-cuda.cu:6862: invalid configuration argument
current device: 0
Not when prompt processed, but in first maeeage processing
CUDA GPU Error Fixed for me: https://github.com/ggerganov/llama.cpp/issues/3740#issuecomment-1783125187
7B and 14B quants are tested and re-uploaded7B 和 14B 量化经过测试并重新上传
Hello, how did you make a 14B gguf file that works properly? I used [python "D:\llama.cpp\convert.py" "D: \14B" - -padvocab], but the converted 14B file could not answer correctly, answer confusion, and output scrambled code. The same thing happens in [CausalLM / 14B-DPO-alpha] and [CausalLM / 8x7B-MoE-test-NOT-MIXTRAL], my system is win11, which runs in cmd.[TheBloke / CausalLM-14B-GGUF] was working normally
This issue was closed because it has been inactive for 14 days since being marked as stale.
Hi guys
A coupe of new and interesting models dropped today:
These are a merge of Qwen + Llama in Llama architecture, but with a vobab.json + merges.txt GPT2 tokenizer, with a vocab size exceeding 150,000.
I was able to make an FP16 with two extra steps:
<dummyXXX>
tokens to added_tokens.json../convert.py --vocabtype bpe --outtype fp16 /path/to/causallm_14b/source /path/to/gguf/causallm_14b.fp16.gguf
This seemed to produce a valid FP16, from which I made quants as normal. For 14B I could only make old-style quants, as many of the tensors are not 256-divisible. For 7B I could make k-quants.
Unfortunately, the resulting files are not usable with llama.cpp, giving this error:
Did I do anything wrong? Or is this a bug?
Full log of attempting to run inference on one of the 7B k-quants: