Bug: Llama 3.1 might not be fully supported yet

Azirine commented 4 months ago

What happened?

Llama 3.1 8B quantized after https://github.com/ggerganov/llama.cpp/pull/8676 fails the "wicks" problem that LLama 3 8B can answer correctly.

Prompt: Making one candle requires 125 grams of wax and 1 wick. How many candles can I make with 500 grams of wax and 3 wicks? Be concise.

Tested with three of the newest quants, all gave the same wrong answer. https://huggingface.co/legraphista/Meta-Llama-3.1-8B-Instruct-IMat-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct.Q8_0.gguf https://huggingface.co/bullerwins/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf

Name and Version

version: 3482 (e54c35e4) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0

What operating system are you seeing the problem on?

Mac

Relevant log output

./llama-cli -m Meta-Llama-3-8B-Instruct-Q8_0.gguf --no-mmap -fa -c 8192 --temp 0 -if --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

With 500 grams of wax, you can make 500 / 125 = 4 candles. With 3 wicks, you can make 3 candles. The limiting factor is the wicks, so you can make 3 candles.

./llama-cli -m Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --no-mmap -fa -c 32768 --temp 0 -if --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

To find the number of candles, divide the total wax (500g) by the wax per candle (125g). Then, divide the result by the number of wicks (3) to account for the wick limitation.

500g / 125g = 4 candles
4 candles / 3 wicks = 1.33 candles (round down to 1, as you can't make a fraction of a candle)

You can make 1 candle with 500 grams of wax and 3 wicks.

steampunque commented 4 months ago

Confirming this observed behavior.

candle.txt

NEW Q6_K 3.1 CONVERT WITH ROPE PATCH

bash-5.1$ lm candle.txt 
To find the number of candles, divide the total wax (500g) by the wax per candle (125g). Then, divide the result by the number of wicks (3) since each candle requires 1 wick.

500g / 125g = 4 candles
4 candles / 3 wicks = 1.33 candles (round down to 1, since you can't make a fraction of a candle)

You can make 1 candle with 500g of wax and 3 wicks.

OLD Q6_K 3.1 CONVERT WITHOUT ROPE PATCH

bash-5.1$ lm candle.txt 
To make one candle, you need 125g of wax and 1 wick. You have 500g of wax and 3 wicks. 

Divide the wax by the wax needed per candle: 500g / 125g = 4 candles.

You have 3 wicks, which is not enough to make 4 candles. So, you can make 3 candles.

Dampfinchen commented 4 months ago

It's possible that 3.1 might have some regressions compared to 3.0. You should test 3.1 f16 gguf versus 3.1 at FP16 running with transformers.

Dampfinchen commented 4 months ago

I think special tokens like <|eot_id|> are added by llama.cpp at inference time. Removing them and it gets it right.

Making one candle requires 125 grams of wax and 1 wick. How many candles can I make with 500 grams of wax and 3 wicks? Be concise.
<|start_header_id|>assistant<|end_header_id|>

To make one candle, you need 125g of wax and 1 wick. You have 500g of wax and 3 wicks.

You can make 500g / 125g = 4 candles with the wax. However, you only have 3 wicks, so you can only make 3 candles.

Azirine commented 4 months ago

I think special tokens like <|eot_id|> are added by llama.cpp at inference time.

It is not added automatically before the input suffix.

./llama-cli -m Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --no-mmap -fa -c 32768 --temp 0 -if --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

[1722170104] == Running in interactive mode. ==
[1722170104]  - Press Ctrl+C to interject at any time.
[1722170104]  - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

[1722170104] embd_inp.size(): 1, n_consumed: 0
[1722170104] eval: [ '<|begin_of_text|>':128000 ]
[1722170104] n_past = 1
[1722170104] embd_inp.size(): 1, n_consumed: 1
[1722170104] waiting for user input
[1722170104] appending input prefix: '<|start_header_id|>user<|end_header_id|>

'
[1722170110] appending input suffix: '<|eot_id|><|start_header_id|>assistant<|end_header_id|>

'
[1722170110] buffer: 'Making one candle requires 125 grams of wax and 1 wick. How many candles can I make with 500 grams of wax and 3 wicks? Be concise.
'
[1722170110] input tokens: [ 'Making':43346, ' one':832, ' candle':38899, ' requires':7612, ' ':220, '125':6549, ' grams':34419, ' of':315, ' wax':37123, ' and':323, ' ':220, '1':16, ' w':289, 'ick':875, '.':13, ' How':2650, ' many':1690, ' candles':52305, ' can':649, ' I':358, ' make':1304, ' with':449, ' ':220, '500':2636, ' grams':34419, ' of':315, ' wax':37123, ' and':323, ' ':220, '3':18, ' w':289, 'icks':5908, '?':30, ' Be':2893, ' concise':64694, '.':627 ]
[1722170110] n_remain: -37
[1722170110] embd_inp.size(): 46, n_consumed: 1
[1722170110] eval: [ '<|start_header_id|>':128006, 'user':882, '<|end_header_id|>':128007, '':271, 'Making':43346, ' one':832, ' candle':38899, ' requires':7612, ' ':220, '125':6549, ' grams':34419, ' of':315, ' wax':37123, ' and':323, ' ':220, '1':16, ' w':289, 'ick':875, '.':13, ' How':2650, ' many':1690, ' candles':52305, ' can':649, ' I':358, ' make':1304, ' with':449, ' ':220, '500':2636, ' grams':34419, ' of':315, ' wax':37123, ' and':323, ' ':220, '3':18, ' w':289, 'icks':5908, '?':30, ' Be':2893, ' concise':64694, '.':627, '<|eot_id|>':128009, '<|start_header_id|>':128006, 'assistant':78191, '<|end_header_id|>':128007, '':271 ]

./llama-cli -m Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --no-mmap -fa -c 32768 --temp 0 -if --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|start_header_id|>assistant<|end_header_id|>\n\n"

[1722169869] == Running in interactive mode. ==
[1722169869]  - Press Ctrl+C to interject at any time.
[1722169869]  - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

[1722169869] embd_inp.size(): 1, n_consumed: 0
[1722169869] eval: [ '<|begin_of_text|>':128000 ]
[1722169869] n_past = 1
[1722169869] embd_inp.size(): 1, n_consumed: 1
[1722169869] waiting for user input
[1722169869] appending input prefix: '<|start_header_id|>user<|end_header_id|>

'
[1722169874] appending input suffix: '<|start_header_id|>assistant<|end_header_id|>

'
[1722169874] buffer: 'Making one candle requires 125 grams of wax and 1 wick. How many candles can I make with 500 grams of wax and 3 wicks? Be concise.
'
[1722169874] input tokens: [ 'Making':43346, ' one':832, ' candle':38899, ' requires':7612, ' ':220, '125':6549, ' grams':34419, ' of':315, ' wax':37123, ' and':323, ' ':220, '1':16, ' w':289, 'ick':875, '.':13, ' How':2650, ' many':1690, ' candles':52305, ' can':649, ' I':358, ' make':1304, ' with':449, ' ':220, '500':2636, ' grams':34419, ' of':315, ' wax':37123, ' and':323, ' ':220, '3':18, ' w':289, 'icks':5908, '?':30, ' Be':2893, ' concise':64694, '.':627 ]
[1722169874] n_remain: -37
[1722169874] embd_inp.size(): 45, n_consumed: 1
[1722169874] eval: [ '<|start_header_id|>':128006, 'user':882, '<|end_header_id|>':128007, '':271, 'Making':43346, ' one':832, ' candle':38899, ' requires':7612, ' ':220, '125':6549, ' grams':34419, ' of':315, ' wax':37123, ' and':323, ' ':220, '1':16, ' w':289, 'ick':875, '.':13, ' How':2650, ' many':1690, ' candles':52305, ' can':649, ' I':358, ' make':1304, ' with':449, ' ':220, '500':2636, ' grams':34419, ' of':315, ' wax':37123, ' and':323, ' ':220, '3':18, ' w':289, 'icks':5908, '?':30, ' Be':2893, ' concise':64694, '.':627, '<|start_header_id|>':128006, 'assistant':78191, '<|end_header_id|>':128007, '':271 ]

Azirine commented 4 months ago

It's possible that 3.1 might have some regressions compared to 3.0.

steampunque showed above that Llama 3.1 can answer correctly when converted without rope patch, meaning it's not caused by 3.1's regression.

steampunque commented 4 months ago

IQ4_XS is okay with the new version:

bash-5.1$ Q=IQ4_XS lm candle.txt 
You can make 4 candles with 500 grams of wax, but you only have 3 wicks, so you can only make 3 candles.

Could be this prompt is right on the edge at a early key token gen position (two tokens with very close probability) and just happens to go the wrong way on the new version with this particular prompt. Heres a no temp and temp 0.3 run using Q6_K on the new version:

bash-5.1$ lm candle.txt 
To find the number of candles, divide the total wax (500g) by the wax per candle (125g). Then, divide the result by the number of wicks (3) since each candle requires 1 wick.

500g / 125g = 4 candles
4 candles / 3 wicks = 1.33 candles (round down to 1, since you can't make a fraction of a candle)

You can make 1 candle with 500g of wax and 3 wicks.

TEMP:

bash-5.1$ TEMP=0.3 lm candle.txt 
To make one candle, you need 125g of wax and 1 wick. You have 500g of wax and 3 wicks. 

You can make 500g / 125g = 4 candles with the wax. However, you only have 3 wicks, so you can only make 3 candles.

Azirine commented 4 months ago

Llama 3.1's answer started to go wrong from here: Then, divide the result by the number of wicks...

So, I kept the beginning part, To find the number of candles, divide the total wax (500g) by the wax per candle (125g).

then asked both Llama 3 and 3.1 8B to complete the answer, looking at the token probabilities of both: (temp 0.3)

Llama 3 (correct)

[( This 98.67%) ( Then 1.13%) ( You 0.12%) (  0.08%)]
[( gives 99.98%) ( leaves 0.02%) ( yields 0.00%) ( is 0.00%)]
[( you 97.83%) (  2.14%) (:\n\n 0.03%) ( us 0.00%)]
[(  99.88%) (:\n\n 0.09%) ( the 0.03%) (: 0.00%)]
[(4 100.00%) (500 0.00%) (400 0.00%) (2 0.00%)]
[( candles 99.99%) ( complete 0.01%) ( whole 0.00%) ( full 0.00%)]
[(. 97.13%) (.\n\n 2.86%) ( with 0.01%) ( ( 0.00%)]
[( Since 99.85%) ( However 0.08%) ( Then 0.03%) ( You 0.02%)]
[( you 99.94%) ( each 0.04%) (  0.02%) ( there 0.00%)]
[( have 99.97%) ( also 0.03%) ( already 0.00%) ( only 0.00%)]
[(  100.00%) ( an 0.00%) ( extra 0.00%) ( only 0.00%)]
[(3 100.00%) (1 0.00%) (2 0.00%) (4 0.00%)]
[( w 99.99%) ( extra 0.01%) ( additional 0.00%) ( spare 0.00%)]
[(icks 100.00%) (cks 0.00%) (ickets 0.00%) (ick 0.00%)]
[(, 100.00%) ( ( 0.00%) ( instead 0.00%) ( and 0.00%)]
[( you 100.00%) ( which 0.00%) ( not 0.00%) ( the 0.00%)]
[( can 100.00%) ( cannot 0.00%) ( have 0.00%) ('ll 0.00%)]
[( make 78.70%) ( only 21.30%) ('t 0.00%) ( still 0.00%)]
[(  100.00%) ( at 0.00%) ( a 0.00%) ( only 0.00%)]
[(3 100.00%) (4 0.00%) (2 0.00%) (1 0.00%)]
[( candles 100.00%) ( of 0.00%) (/ 0.00%) ( out 0.00%)]
[(. 99.84%) ( ( 0.11%) (, 0.03%) ( with 0.02%)]
[(<|eot_id|> 100.00%) ( The 0.00%) ( ( 0.00%) ( Answer 0.00%)]

Llama 3.1 (incorrect)

[( Then 39.11%) ( \n\n 25.09%) ( This 24.65%) ( Since 9.67%)]
[(, 98.47%) ( divide 1.52%) ( consider 0.01%) ( multiply 0.00%)]
[( divide 96.80%) ( consider 3.04%) ( check 0.06%) ( since 0.06%)]
[( the 93.94%) ( by 4.92%) ( that 1.10%) ( this 0.04%)]
[( result 82.64%) ( total 11.28%) ( number 6.08%) ( w 0.00%)]
[( by 100.00%) ( ( 0.00%) ( of 0.00%) ( into 0.00%)]
[( the 99.99%) (  0.01%) ( number 0.00%) ( ( 0.00%)]
[( number 99.96%) ( total 0.04%) ( ratio 0.00%) ( w 0.00%)]
[( of 100.00%) ( w 0.00%) ( available 0.00%) ( that 0.00%)]
[( w 100.00%) ( available 0.00%) ( candles 0.00%) ( usable 0.00%)]
[(icks 100.00%) (ick 0.00%) (oks 0.00%) (cks 0.00%)]
[( ( 99.81%) ( available 0.11%) ( you 0.07%) ( per 0.01%)]
[(3 100.00%) (1 0.00%) (since 0.00%) (you 0.00%)]
[() 51.52%) (). 19.87%) ().\n\n 15.59%) (), 13.02%)]
[( to 62.59%) ( since 35.48%) ( because 1.14%) ( and 0.79%)]
[( account 99.85%) ( find 0.13%) ( ensure 0.01%) ( avoid 0.00%)]
[( for 100.00%) ( that 0.00%) ( only 0.00%) ( the 0.00%)]
[( the 99.99%) ( each 0.01%) ( having 0.00%) ( w 0.00%)]
[( w 70.95%) ( limitation 16.01%) ( fact 10.62%) ( limited 1.34%)]
[(ick 94.26%) (icks 5.74%) (icking 0.00%) (ix 0.00%)]
[( limitation 99.99%) ( requirement 0.01%) ( limit 0.00%) ( shortage 0.00%)]
[(.\n\n 96.72%) (. 3.27%) (:\n\n 0.01%) (: 0.00%)]
[(500 99.92%) (C 0.04%) (( 0.02%) (Number 0.01%)]
[(g 100.00%) ( g 0.00%) ( / 0.00%) ( ? 0.00%)]
[( / 83.44%) ( ? 16.55%) ( ( 0.00%) ( wax 0.00%)]
[(  100.00%) ( ( 0.00%) (125 0.00%) (  0.00%)]
[(125 100.00%) (250 0.00%) (500 0.00%) (3 0.00%)]
[(g 100.00%) ( = 0.00%) ( g 0.00%) ( grams 0.00%)]
[( = 99.39%) ( per 0.59%) (/c 0.01%) ( ( 0.00%)]
[(  100.00%) (4 0.00%) ( x 0.00%) ( ( 0.00%)]
[(4 100.00%) (  0.00%) (5 0.00%) (400 0.00%)]
[( candles 96.94%) (\n 3.05%) ( ( 0.01%) (\n\n 0.00%)]
[(\n 99.88%) ( ( 0.12%) (\n\n 0.00%) (. 0.00%)]
[(4 100.00%) (Since 0.00%) (However 0.00%) (  0.00%)]
[( candles 99.89%) ( / 0.11%) ( candies 0.00%) (/ 0.00%)]
[( / 100.00%) ( ? 0.00%) ( * 0.00%) ( with 0.00%)]
[(  100.00%) ( ( 0.00%) ( w 0.00%) (3 0.00%)]
[(3 100.00%) (1 0.00%) (2 0.00%) (4 0.00%)]
[( w 99.64%) ( = 0.35%) ( ( 0.00%) ( is 0.00%)]
[(icks 100.00%) (cks 0.00%) (ick 0.00%) (inks 0.00%)]
[( = 100.00%) ( per 0.00%) ( ≈ 0.00%) (/c 0.00%)]
[(  99.99%) ( approximately 0.01%) ( not 0.00%) ( ( 0.00%)]
[(1 100.00%) (4 0.00%) (2 0.00%) (3 0.00%)]
[(. 100.00%) ( candle 0.00%) ( full 0.00%) ( ( 0.00%)]
[(33 100.00%) (333 0.00%) (3 0.00%) (32 0.00%)]
[( candles 71.25%) ( ( 28.48%) (, 0.27%) (\n\n 0.00%)]
[( ( 98.64%) (\n\n 1.34%) (, 0.01%) ( per 0.00%)]
[(round 99.97%) (you 0.02%) (but 0.01%) (approximately 0.00%)]
[( down 100.00%) ( to 0.00%) ( up 0.00%) (ing 0.00%)]
[( to 100.00%) ( since 0.00%) (, 0.00%) ( because 0.00%)]
[(  100.00%) ( whole 0.00%) ( the 0.00%) ( make 0.00%)]
[(1 100.00%) (0 0.00%) (4 0.00%) (2 0.00%)]
[(, 80.74%) ( candle 19.13%) ( since 0.13%) ()\n\n 0.00%)]
[( as 72.41%) ( since 27.59%) ( assuming 0.00%) ( but 0.00%)]
[( you 99.99%) ( partial 0.01%) ( a 0.00%) (  0.00%)]
[( can 99.99%) ( cannot 0.01%) ( need 0.00%) ( only 0.00%)]
[('t 100.00%) ( only 0.00%) ( make 0.00%) ( still 0.00%)]
[( make 100.00%) ( have 0.00%) ( partially 0.00%) ( split 0.00%)]
[( a 100.00%) (  0.00%) ( part 0.00%) ( fractions 0.00%)]
[( fraction 100.00%) ( partial 0.00%) ( candle 0.00%) ( fractional 0.00%)]
[( of 100.00%) ( or 0.00%) ( candle 0.00%) ()\n\n 0.00%)]
[( a 100.00%) ( candle 0.00%) ( an 0.00%) ( candles 0.00%)]
[( candle 100.00%) ( complete 0.00%) ( cake 0.00%) ( wax 0.00%)]
[()\n\n 97.38%) () 2.46%) ( with 0.16%) (, 0.00%)]
[(You 97.59%) (Answer 2.37%) (So 0.04%) (With 0.00%)]
[( can 100.00%) ( have 0.00%) ('ll 0.00%) ( cannot 0.00%)]
[( make 100.00%) ( only 0.00%) ('t 0.00%) ( actually 0.00%)]
[(  99.99%) ( approximately 0.01%) ( at 0.00%) ( ** 0.00%)]
[(1 100.00%) (4 0.00%) (2 0.00%) (3 0.00%)]
[( candle 100.00%) ( complete 0.00%) ( full 0.00%) (- 0.00%)]
[( with 73.85%) (. 26.15%) (, 0.00%) ( or 0.00%)]
[(  81.13%) ( the 18.87%) ( this 0.00%) ( these 0.00%)]
[(500 99.98%) (3 0.02%) (1 0.00%) (2 0.00%)]
[( grams 89.26%) (g 10.74%) (grams 0.00%) ( g 0.00%)]
[( of 100.00%) ( and 0.00%) (, 0.00%) (. 0.00%)]
[( wax 100.00%) (  0.00%) ( w 0.00%) ( the 0.00%)]
[( and 100.00%) (. 0.00%) (, 0.00%) ( using 0.00%)]
[(  100.00%) ( the 0.00%) ( have 0.00%) ( a 0.00%)]
[(3 100.00%) (1 0.00%) (2 0.00%) (4 0.00%)]
[( w 100.00%) ( available 0.00%) ( sticks 0.00%) ( ( 0.00%)]
[(icks 100.00%) (ickets 0.00%) (ick 0.00%) (ickers 0.00%)]
[(. 100.00%) (, 0.00%) ( ( 0.00%) (<|eot_id|> 0.00%)]
[(<|eot_id|> 100.00%) ( The 0.00%) ( You 0.00%) ( ( 0.00%)]

They key is that from here, To find the number of candles, divide the total wax (500g) by the wax per candle (125g). Then,

Llama 3.1 says we should divide again, which is wrong. [( divide 96.80%) ( consider 3.04%) ( check 0.06%) ( since 0.06%)]

Let's see how Llama 3 completes this: [( subtract 100.00%) ( divide 0.00%) ( adjust 0.00%) ( since 0.00%)]

As we can see, this is not a case of "two tokens with very close probability".

steampunque commented 4 months ago

I'm not so sure. I think this prompt is on the edge somehow. I have an experimental beam search function in my server and testing it out shows inconsistent behavior tracking different beams using Q6_K:

2 BEAM: bad

bash-5.1$ BEAMS=2 lm candle.txt 
You can make 4 candles with 500 grams of wax (500 / 125 = 4) and you have 1 wick left over.
LOGPROB -10.587428092956543
You can make 4 candles with 500 grams of wax (500 / 125 = 4) and you'll have 1 wick left over.
LOGPROB -10.953845024108887

3 BEAM: OK

bash-5.1$ BEAMS=3 lm candle.txt 
To make one candle, you need 125g of wax and 1 wick. You have 500g of wax and 3 wicks. 

Divide 500g by 125g to get the number of candles you can make with the wax: 500g / 125g = 4 candles.

Since you have 3 wicks, you can only make 3 candles.
LOGPROB -19.450170516967773
To make one candle, you need 125g of wax and 1 wick. You have 500g of wax and 3 wicks. 

Divide 500g by 125g to get the number of candles you can make with the wax: 500g / 125g = 4 candles.

Since you have 3 wicks, you can only make 3 candles because each candle requires 1 wick.
LOGPROB -20.716386795043945
To make one candle, you need 125g of wax and 1 wick. You have 500g of wax and 3 wicks. 

Divide 500g by 125g to get the number of candles you can make with the wax: 500g / 125g = 4 candles.

You have 3 wicks, which is not enough to make 4 candles. So, you can make 3 candles.
LOGPROB -21.438295364379883

Very tiny differences in the gen probs causes the beam results to diverge quite a bit on the prompt. I'm not 100% on my beam search code but I don't see anything overtly bad in the results, just different conclusions.

I also have a function where I can lead the assistant response and if I do that all is OK:

bash-5.1$ START=1 lm @candle.txt "Considering both wax and wicks as limiting factors,"
 we can calculate the number of candles as follows:

Wax: 500g / 125g per candle = 4 candles
Wicks: 3 wicks / 1 wick per candle = 3 candles

Since the wicks are the limiting factor, we can only make 3 candles.

I'm rerunning some benches as I logged benches on the the non ROPE version. Initial results seem OK on new version, will post summary here, but as of now I see nothing overtly wrong, seems to me just like an edge case prompt. These models are not actually intelligent, they just follow the dots on both how they were trained and how the cumulative noise plays out in the calcs at inference time.

steampunque commented 4 months ago

Benches look OK with the rope patch. One noteable regression on code but not huge. No glaring performance degradation. Also no guarantees it does not have a problem, hard to be 100% conclusive but I'm pretty sure the tested prompt is an edge case for the model. Relative to gemma2 9b I have found the model is not strong on reasoning and can get thrown off or flat out hallucinate quite easily.

model	Meta-Llama-3.1-8B-Instruct no ROPE patch	Meta-Llama-3.1-8B-Instruct ROPE patch	gemma-2-9b-it SOTA
quant	Q6_K	Q6_K	Q6_K
TEST	-	-	-
WG	0.737	0.741	0.762
LAMBADA	0.705	0.705	0.735
HELLASWAG	0.694	0.696	0.775
TQA1	0.556	0.564	0.701
TQA2	0.510	0.512	0.692
BOOLQ	0.612	0.610	0.687
ARCC	0.776	0.776	0.882
ARCE	0.905	0.906	0.952
RACEM	0.725	0.734	0.849
RACEH	0.678	0.679	0.802
CSQA	0.683	0.686	0.751
OBQA	0.765	0.765	0.846
COPA	0.887	0.889	0.925
PIQA	0.723	0.725	0.801
SIQA	0.647	0.648	0.693
JEOPARDY	0.540	0.510	0.550
GSM8K 0-shot CoT	0.870	0.872	0.890
HUMANEVAL	0.664	0.652	0.658
COMPOSITES	-	-	-
REASONING	0.737	0.738	0.806
UNDERSTANDING	0.692	0.695	0.816
LANGUAGE	0.705	0.705	0.735
KNOWLEDGE	0.561	0.561	0.696
COT	0.870	0.872	0.890
CODE	0.664	0.652	0.658

Gemma 2 9b is bad on this prompt too:

bash-5.1$ lm candle.txt
You can make **4** candles.

The problem is the "be concise" statement. If you let it think by removing the be concise:

bash-5.1$ lm Making one candle requires 125 grams of wax and 1 wick. How many candles can I make with 500 grams of wax and 3 wicks?
Here's how to solve this:

* **Candles per wax amount:** You can make 500 grams of wax / 125 grams per candle = 4 candles.
* **Candles per wick amount:** You can make 3 wicks / 1 wick per candle = 3 candles.

* **Limiting factor:** Since you can only make 3 candles with the available wicks, that's the maximum number of candles you can make. 

**Answer:** You can make 3 candles.

Now do the same with Llama 3.1:

bash-5.1$ lm Making one candle requires 125 grams of wax and 1 wick. How many candles can I make with 500 grams of wax and 3 wicks?
## Step 1: Calculate the total amount of wax available
We have 500 grams of wax available.

## Step 2: Calculate the total number of wicks available
We have 3 wicks available.

## Step 3: Determine the limiting factor
Since each candle requires 1 wick, the number of wicks available (3) is the limiting factor. We cannot make more candles than we have wicks for.

## Step 4: Calculate the number of candles that can be made with the available wicks
Since each candle requires 1 wick, we can make 3 candles with the available wicks.

## Step 5: Check if the wax is sufficient for the number of candles that can be made
Each candle requires 125 grams of wax. With 3 wicks, we can make 3 candles. The total wax required for 3 candles is 3 * 125 = 375 grams. Since we have 500 grams of wax available, which is more than the 375 grams required, the wax is sufficient.

## Step 6: Determine the final number of candles that can be made
Since the wax is sufficient and the number of wicks is the limiting factor, we can make 3 candles.

The final answer is: $\boxed{3}$

The "be concise" turns this into an adversarial prompt for the models.

qnixsynapse commented 4 months ago

There are some serious issues with this model. Even Llama 3.0 8B was not this bad.

The reasoning is so horrible when I asked about why others do not have birthday this year. I am currently using @bartowski1182 's quant... I will retry quantizing myself to see if this persists.

bartowski1182 commented 4 months ago

Where are you running it?

qnixsynapse commented 4 months ago

In my PC ofcourse.

Here is reasoning of my quant using my own imatrix dataset 🥲:

Gemma 9B gives a much better response though!

Vaibhavs10 commented 4 months ago

In addition to the above mentioned, with the most recent RoPE patch, it OOMs on my MBP 24GB.

Freshly quantised model with: https://huggingface.co/spaces/ggml-org/gguf-my-repo, model repo here: https://huggingface.co/reach-vb/Meta-Llama-3.1-8B-Instruct-Q6_K-GGUF

I get the same results from @bartowski1182 and lm-studio quants too.

EDIT: I messed up, wasn't passing the ctx-size 🤗

Stacktrace:

llama-cli --hf-repo reach-vb/Meta-Llama-3.1-8B-Instruct-Q6_K-GGUF --hf-file meta-llama-3.1-8b-instruct-
q6_k.gguf -p "The meaning to life and the universe is"
Log start
main: build = 3485 (6eeaeba1)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
main: seed  = 1722265129
llama_download_file: no previous model file found /Users/vb/Library/Caches/llama.cpp/meta-llama-3.1-8b-instruct-q6_k.gguf
llama_download_file: downloading from https://huggingface.co/reach-vb/Meta-Llama-3.1-8B-Instruct-Q6_K-GGUF/resolve/main/meta-llama-3.1-8b-instruct-q6_k.gguf to /Users/vb/Library/Caches/llama.cpp/meta-llama-3.1-8b-instruct-q6_k.gguf (server_etag:"fe7a41737a17475cd648b7d276b0d8ee-413", server_last_modified:Mon, 29 Jul 2024 14:55:54 GMT)...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1175  100  1175    0     0  11385      0 --:--:-- --:--:-- --:--:-- 11385
100 6290M  100 6290M    0     0  29.4M      0  0:03:33  0:03:33 --:--:-- 8268k
llama_download_file: file metadata saved: /Users/vb/Library/Caches/llama.cpp/meta-llama-3.1-8b-instruct-q6_k.gguf.json
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /Users/vb/Library/Caches/llama.cpp/meta-llama-3.1-8b-instruct-q6_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 18
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q6_K:  226 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 6.14 GiB (6.56 BPW)
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =  6282.98 MiB, ( 6283.05 / 16384.02)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   410.98 MiB
llm_load_tensors:      Metal buffer size =  6282.97 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 17179.89 MB
llama_kv_cache_init:      Metal KV buffer size = 16384.00 MiB
llama_new_context_with_model: KV self size  = 16384.00 MiB, K (f16): 8192.00 MiB, V (f16): 8192.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      Metal compute buffer size =  8480.00 MiB
llama_new_context_with_model:        CPU compute buffer size =   264.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)

system_info: n_threads = 4 / 8 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 131072, n_batch = 2048, n_predict = -1, n_keep = 1

The meaning to life and the universe isggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
%

llama_print_timings:        load time =   14394.80 ms
llama_print_timings:      sample time =       0.09 ms /     1 runs   (    0.09 ms per token, 11235.96 tokens per second)
llama_print_timings: prompt eval time =    3604.44 ms /     9 tokens (  400.49 ms per token,     2.50 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    5437.14 ms /    10 tokens
^C^C%

bartowski1182 commented 4 months ago

@qnixsynapse I meant with what tool, is it fully updated?

@Vaibhavs10 I think that's expected if you don't manually limit the n_ctx because it'll try to allocate enough memory to hold the entire context ahead of time, which will be an absolutely absurd amount of memory

qnixsynapse commented 4 months ago

Yeah. It is fully updated.

@Vaibhavs10 KV cache size at full context is about 16 GB. I clip it to 10k only.

Vaibhavs10 commented 4 months ago

Ah makes sense thanks @bartowski1182 & @qnixsynapse 🤗

Not sure how I missed clipping the context. using ctx-size 8096 works wonders!

Dampfinchen commented 4 months ago

There are some serious issues with this model. Even Llama 3.0 8B was not this bad.

The reasoning is so horrible when I asked about why others do not have birthday this year. I am currently using @bartowski1182 's quant... I will retry quantizing myself to see if this persists.

This does not have to do anything with llama.cpp and the quants. The model at FP16 on LMSys gives a very similar response.

Unfortunately compared to Gemma2, L3 8B is quite dumb.

Edit: Seems like all LLMs have issues with that question. Only Claude Sonnet added a little remark "Several celebrities who starred in the Avengers movies have birthdays every year, as do all living people."

ggerganov / llama.cpp