Closed coder543 closed 2 months ago
I reproduced this in main
,
./main -m ~/Meta-Llama-3-8B-Instruct-IQ3_M.gguf -s 7 -i --color --penalize-nl -e --temp 0 -r "<|eot_id|>" --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n"
I think the result is related to the way the llama.cpp
tokenizer chunks 3333+777
, see here:
buffer: '3333+777?'
input tokens: [ ... '33':1644, '33':1644, '+':10, '777':15831, ... ]
It works as expected with: 3,333 + 777
(add comma):
4333+777
(no comma) also worked as expected, so it's not a gguf problem.
likely related to BPE token merging behavior https://github.com/ggerganov/llama.cpp/issues/6809
Same on my side, Bartowski fresh conversion from https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/tree/main , Q8_0 quant
I did these tests as well and can confirm your findings. Exllama v2 also did better in my own test I mentioned in the discussion yesterday.
fixed by https://github.com/ggerganov/llama.cpp/pull/6920. log with PR:
buffer: '3333+777?'
input tokens: [ ..., '333':8765, '3':18, '+':10, '777':15831, ...]
@Jeximo some other tests you can try, taken from the JS library sample reference:
// Simple test case
test("grabbed", [59312, 2788])
// Naive implementation produces inconsistent tokenization for " grabbed"
test(" grabbed", [30418])
// Naive implementation (in LLAMA 1 TOKENIZER) uses incorrect merge order for multiple consecutive space merges
test(" grabbed", [1881, 30418])
// Linebreak and tab are handling
test("\n", [198])
test(" \n", [720])
test(" tabs out here", [3324, 3518, 573, 14294, 1618])
// Equal prio merges are performed left-to-right
test("ax\n####\nboo", [710, 198, 71050, 34093])
// UTF-8 multipoint character that should be found in vocabulary
test('镇', [104643])
// UTF-8 multipoint character that should NOT be found in vocabulary
test('🦙', [9468, 99, 247])
// Consecutive UTF-8 multipoint characters that are NOT found in a vocabulary and use different number of bytes
test('🦙Ꙋ', [9468, 99, 247, 166, 247, 232])
test('Ꙋ🦙', [166, 247, 232, 9468, 99, 247])
// Test input from official LLaMA 3 library
test('This is a test sentence.', [2028, 374, 264, 1296, 11914, 13])
Note regarding the test cases above: I wrote those test cases for LLaMA 1 tokenization, and updated the "expected results" for LLaMA 3 (with the exception of the last case, which is copied from official repo). So, those tests are definitely better than nothing, but they are not "optimal test cases" to uncover potential bugs in LLaMA 3 tokenization. Would be good to find some adversarial inputs specific to LLaMA 3 tokenization and update the tests as such.
So I just tested the current llama3 models and despite the changes to support llama3 in latest ollama release, the official llama3 model that is uploaded still gives the wrong result
What is 3333 + 777?
Let me calculate that for you!
33,333 + 777 = 34,110
this is with llama3:70b-instruct-q4_K_M 5338d7c58d8d
@nkeilar what is “the official one”? If you mean ollama… ollama isn’t official, and also they won’t have regenerated their GGUFs yet. Wherever you’re looking, when was the file generated? The changes made require the GGUFs to be generated again. It isn’t just a change to llama.cpp.
I haven’t tried to test the changes, but the discussion around the changes sounded convincing to me.
Someone here apparently had good luck with a new GGUF of the 8B model.
llama.cpp
+ 8b IQ_3M
@Jeximo is that a 3-bit quantization? 🤔
regardless, looks correct.
After pulling the latest llama.cpp repo, and the latest Llama 3 8B Instruct HF repo, converting and quantizing, I can confirm that the issues are indeed fixed:
fixed in #6920
Try this query: "What is 3333+777?"
Yes, yes, LLMs are bad at math. That's not what I'm getting at. Someone mentioned this on Reddit, and I have to agree that I'm seeing weird stuff too.
Let's get a baseline. Here is what meta.ai yields:
This is likely running on Llama 3 70B.
Here is what Groq yields:
and at 8B:
Now, here's where things get weird. Using Open WebUI on top of Ollama, let's use llama.cpp to run the GGUFs of Llama 3.
First, 8B at fp16:
Then 8B at Q8_0:
Then 70B at Q4_0:
I think the problem should be clear. All of the non-llama.cpp instances that were not using GGUFs did the math problem correctly. All of the llama.cpp instances got the problem wrong in exactly the same way. This issue is extremely repeatable on both ends. I have never seen the cloud instances make this mistake, and I have never seen the llama.cpp instances not make this exact mistake of adding an extra digit to the problem and then getting it wrong.
To me, it appears that something is degrading the accuracy of Llama 3 when run under llama.cpp.
Any ideas of what's going wrong here?