Something might be wrong with either llama.cpp or the Llama 3 GGUFs

coder543 commented 2 months ago

Try this query: "What is 3333+777?"

Yes, yes, LLMs are bad at math. That's not what I'm getting at. Someone mentioned this on Reddit, and I have to agree that I'm seeing weird stuff too.

Let's get a baseline. Here is what meta.ai yields:

This is likely running on Llama 3 70B.

Here is what Groq yields:

and at 8B:

Now, here's where things get weird. Using Open WebUI on top of Ollama, let's use llama.cpp to run the GGUFs of Llama 3.

First, 8B at fp16:

Then 8B at Q8_0:

Then 70B at Q4_0:

I think the problem should be clear. All of the non-llama.cpp instances that were not using GGUFs did the math problem correctly. All of the llama.cpp instances got the problem wrong in exactly the same way. This issue is extremely repeatable on both ends. I have never seen the cloud instances make this mistake, and I have never seen the llama.cpp instances not make this exact mistake of adding an extra digit to the problem and then getting it wrong.

To me, it appears that something is degrading the accuracy of Llama 3 when run under llama.cpp.

Any ideas of what's going wrong here?

Jeximo commented 2 months ago

I reproduced this in main,

./main -m ~/Meta-Llama-3-8B-Instruct-IQ3_M.gguf -s 7 -i --color --penalize-nl -e --temp 0 -r "<|eot_id|>" --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n"

I think the result is related to the way the llama.cpp tokenizer chunks 3333+777, see here:

buffer: '3333+777?'

input tokens: [ ... '33':1644, '33':1644, '+':10, '777':15831, ... ]

It works as expected with: 3,333 + 777(add comma): Screenshot_20240425-202220_Termux

4333+777(no comma) also worked as expected, so it's not a gguf problem.

LostRuins commented 2 months ago

likely related to BPE token merging behavior https://github.com/ggerganov/llama.cpp/issues/6809

itodorovic commented 2 months ago

Same on my side, Bartowski fresh conversion from https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/tree/main , Q8_0 quant

Dampfinchen commented 2 months ago

I did these tests as well and can confirm your findings. Exllama v2 also did better in my own test I mentioned in the discussion yesterday.

Jeximo commented 2 months ago

fixed by https://github.com/ggerganov/llama.cpp/pull/6920. log with PR:

buffer: '3333+777?'

input tokens: [ ..., '333':8765, '3':18, '+':10, '777':15831, ...]

Screenshot_20240426-164038_Termux

LostRuins commented 2 months ago

@Jeximo some other tests you can try, taken from the JS library sample reference:

// Simple test case
test("grabbed",                           [59312, 2788])

// Naive implementation produces inconsistent tokenization for " grabbed" 
test(" grabbed",                          [30418])

// Naive implementation (in LLAMA 1 TOKENIZER) uses incorrect merge order for multiple consecutive space merges
test("           grabbed",                [1881, 30418])

// Linebreak and tab are handling
test("\n",                                [198])
test(" \n",                               [720])
test("  tabs                out here",   [3324, 3518, 573, 14294, 1618])

// Equal prio merges are performed left-to-right 
test("ax\n####\nboo",                     [710, 198, 71050, 34093])

// UTF-8 multipoint character that should be found in vocabulary
test('镇',                                [104643])

// UTF-8 multipoint character that should NOT be found in vocabulary
test('🦙',                               [9468, 99, 247])

// Consecutive UTF-8 multipoint characters that are NOT found in a vocabulary and use different number of bytes
test('🦙Ꙋ',                             [9468, 99, 247, 166, 247, 232])
test('Ꙋ🦙',                             [166, 247, 232, 9468, 99, 247])

// Test input from official LLaMA 3 library
test('This is a test sentence.',         [2028, 374, 264, 1296, 11914, 13])

belladoreai commented 2 months ago

Note regarding the test cases above: I wrote those test cases for LLaMA 1 tokenization, and updated the "expected results" for LLaMA 3 (with the exception of the last case, which is copied from official repo). So, those tests are definitely better than nothing, but they are not "optimal test cases" to uncover potential bugs in LLaMA 3 tokenization. Would be good to find some adversarial inputs specific to LLaMA 3 tokenization and update the tests as such.

nkeilar commented 2 months ago

So I just tested the current llama3 models and despite the changes to support llama3 in latest ollama release, the official llama3 model that is uploaded still gives the wrong result

What is 3333 + 777? 

Let me calculate that for you!

33,333 + 777 = 34,110

this is with llama3:70b-instruct-q4_K_M 5338d7c58d8d

coder543 commented 2 months ago

@nkeilar what is “the official one”? If you mean ollama… ollama isn’t official, and also they won’t have regenerated their GGUFs yet. Wherever you’re looking, when was the file generated? The changes made require the GGUFs to be generated again. It isn’t just a change to llama.cpp.

I haven’t tried to test the changes, but the discussion around the changes sounded convincing to me.

coder543 commented 2 months ago

Someone here apparently had good luck with a new GGUF of the 8B model.

Jeximo commented 2 months ago

llama.cpp + 8b IQ_3M

Screenshot_20240430_002136

coder543 commented 2 months ago

@Jeximo is that a 3-bit quantization? 🤔

regardless, looks correct.

ddh0 commented 2 months ago

After pulling the latest llama.cpp repo, and the latest Llama 3 8B Instruct HF repo, converting and quantizing, I can confirm that the issues are indeed fixed:

Galunid commented 2 months ago

fixed in #6920

ggerganov / llama.cpp

Something might be wrong with either llama.cpp or the Llama 3 GGUFs #6914