Feature Request: Proper Llama 3.1 Support in llama.cpp

ggerganov / llama.cpp

LLM inference in C/C++

MIT License

65.07k stars 9.33k forks source link

Feature Request: Proper Llama 3.1 Support in llama.cpp #8650

Closed Vaibhavs10 closed 1 month ago

Vaibhavs10 commented 1 month ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Llama 3.1 was just released and it is a significant leg up from the previous series of models: https://huggingface.co/blog/llama31

Whilst the overall architecture is the same, it requires some modelling updates, primarily around RoPE scaling: https://github.com/huggingface/transformers/blob/bc2adb0112b6677b0dfb4105c74570a0f92183eb/src/transformers/modeling_rope_utils.py#L298

It'd be great to add support for those so that the generations are more coherent and make sense.

Motivation

Note: Without the modelling changes, the generation might look coherent, but they are far from great and the true-st potential of the model!

Possible Implementation

Here's the corresponding transformers implementation: https://github.com/huggingface/transformers/blob/bc2adb0112b6677b0dfb4105c74570a0f92183eb/src/transformers/modeling_rope_utils.py#L298

RodriMora commented 1 month ago

Some new test for the question:

If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight?

llama.cpp ed67bcb temp 0.0, top_P 1. No system prompt ~/llama.cpp/llama-server -m ~/models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -c 8000 -ngl 99 --host 0.0.0.0 --port 5000

with my own quant of 8B Q8_0 (done post tokenizer fix)

correct answer:

[...]= 60.5 kg + 3.025 kg ≈ 63.525 kg So, if you gained 5% of your current weight, you would weigh approximately 63.5 kg.

With -fa -ctk q4_0 -ctv q4_0. Same temp 0.0 and top_P 1. No system prompt: ~/llama.cpp/llama-server -m ~/models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -c 8000 -ngl 99 --host 0.0.0.0 --port 5000 -fa -ctk q4_0 -ctv q4_0

Incorrect answer:

= 54.4 kg + 2.72 kg ≈ 57.12 kg So, if you gained 5% of your current weight, you would weigh approximately 57.12 kg.

Control with Groq and vLLM Groq with temp 0.0 and top_p 1 and vllm locally using the safetensor model give the correct answer like the Q8_0 without FA.

Tested @tristandruyen Q8 model with the RoPE PR but got and error, downloaded it twice and the sha256 checks out:

TLDR: The gguf quants work as expected with 8K contexts with the tokenizer fix without using FA

tristandruyen commented 1 month ago

Tested @tristandruyen Q8 model with the RoPE PR but got and error, downloaded it twice and the sha256 checks out:

That's weird. AFAIK this is a known error when using models made with the PR with an old llama.cpp version, which will probably be fixed soon...

Are you sure you recompiled ?

What command did you use to cause the error ?

RodriMora commented 1 month ago

Tested @tristandruyen Q8 model with the RoPE PR but got and error, downloaded it twice and the sha256 checks out:

That's weird. AFAIK this is a known error when using models made with the PR with an old llama.cpp version, are you sure you recompiled ?

What command did you use to create the error ?

Yes. Changed to the PR, then recompiled, then run llama-server

gh pr checkout 8676
make GGML_CUDA=1
~/llama.cpp/llama-server -m ~/models/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf -c 128000 --host 0.0.0.0 --port 5000

Edit: Recompiled again and it worked

ubergarm commented 1 month ago

ok ..wait for a new gguf to test ;)

It's here: https://huggingface.co/qwp4w3hyb/Meta-Llama-3.1-8B-Instruct-iMat-GGUF/resolve/main/meta-llama-3.1-8b-instruct-imat-Q8_0.gguf?download=true

Thanks @tristandruyen - just tried your qwp4w3hyb/Meta-Llama-3.1-8B-Instruct-iMat-GGUF with this PR and it is the first one I've tried that didn't output gibberish with my long context input!

Notes and local generation speed here. Cheers and thanks all for your efforts!

RodriMora commented 1 month ago

Good news.

Tested long context with the @tristandruyen Q8 model with iMat as well as my own quantized Q8_0 without an imatrix using this pr https://github.com/ggerganov/llama.cpp/pull/8676 and long context works.

Control group using https://console.groq.com/playground

Copy/pasted this paper and asked to summarize it

Temp 0.0, top_p 1

Tested with https://github.com/ggerganov/llama.cpp/pull/8676 used both for inference and generating the Q8 model Temp 0.0 and top_p 1 ./llama-server -m ~/models/Meta-Llama-3.1-8B-Instruct-q8_0rope.gguf -c 128000 -ngl 99 --host 0.0.0.0 --port 5000 Same result as control:

Note: The RoPE PR breaks with previous generated GGUF's, so that still needs to be fixed Note2: I think is also fixes FA which didn't work https://github.com/ggerganov/llama.cpp/issues/8650#issuecomment-2250605036

mirek190 commented 1 month ago

RodriMora

Rope is working with fixes and gguf of @tristandruyen is fine.

I pasted into llama-cli 3.1 8b q8 a 110k tokens of text ( that's a lot text 0_0 ) and asked to fix paragraphs as text was not formated at all .... did it in 20 minutes ... generated insane amount of text with nice paragraphs 0_0 and is proper. I checked.

Nottlespike commented 1 month ago

Spotted a gguf on X that I thought should be interesting for comparison GGUF-FIXED

As far as I understand this only changes the tokenizer type from smaug-bpe to llama-bpe, which is already fixed in newer GGUF's as it was due to a upstream error in the huggingface repo. This also likely does not include the fixes from #8676

Also I can't run this as its 405B 😭

I uh made this and it works at 110k ctx

mirek190 commented 1 month ago

Tested more ... temp 0.6 ( with temp 0 answers are correct 10/10 )
groq.com also uses similar temp 0.6

"If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight?"

"I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?"

"Making one candle requires 125 grams of wax and 1 wick. How many candles can I make with 500 grams of wax and 3 wicks? Be concise."

Currently: groq.com ALWAYS correct - 10/10 correct locally something around - 8-9/10 times are correct.

ghchris2021 commented 1 month ago

@mirek190

Tested more ... temp 0.6 ( with temp 0 answers are correct 10/10 ) groq.com also uses similar temp 0.6

Currently: groq.com ALWAYS correct - 10/10 correct locally something around - 8-9/10 times are correct.

It's probably stating the obvious but if temp 0 == fine and temp 0.6 == 90% fine but groq always fine then I'd look at what is non-deterministic about the local runs. You'd expect a different randomization value determined by the seed (unless I'm mixing this up with something else entirely) so maybe make a list of a few seed / PRNG values that work and few ones that don't then repeat the query using those exact RNG values and look for whether the output is 100% deterministic and repeated correct / incorrect as one might expect.

The other category of variables could have to do with what else is in the context prior to the turn in question (unless you start with deterministic pre-context for every run) so eliminate that if that remains a variable.

Otherwise compare the groq settings and see if you can affect / observe the RNG / seed / whatever and see if there is any variable you can correlate / control at all (maybe not).

Isn't there a way to view the token probabilities selected amongst to determine the output? You could look for divergence of path as to when it might pick a nearly equiprobable option or something rarely?

mirek190 commented 1 month ago

About groq.com - Each time the answer is different but correct so it is not temp 0 for sure. Have to check if I can ger more control with groq....

At least we can see on nvidia webpage

from openai import OpenAI

client = OpenAI(
  base_url = "https://integrate.api.nvidia.com/v1",
  api_key = "$API_KEY_REQUIRED_IF_EXECUTING_OUTSIDE_NGC"
)

completion = client.chat.completions.create(
  model="meta/llama-3.1-8b-instruct",
  messages=[{"role":"user","content":"If my BMI is 20.5 and my height is 172cm, how much would I weigh if I gained 5% of my current weight?"}],
  temperature=0.2,
  top_p=0.7,
  max_tokens=1024,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

tristandruyen commented 1 month ago

About groq.com - Each time the answer is different but correct so it is not temp 0 for sure. Have to check if I can ger more control with groq....

@mirek190 You can set temperature with groq if you register and use the console playground instead of groq.com chat:

https://console.groq.com/playground

gformcreation commented 1 month ago

Hi, guys can anyone help me out, while trying to load the latest gguf which has a fix mentioned by @tristandruyen i am getting the following error.

ggml_cuda_init: found 1 CUDA devices: Device 0: Tesla T4, compute capability 7.5, VMM: no llm_load_tensors: ggml ctx size = 0.14 MiB llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291 llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model 'C:\Users\user123\Downloads\meta-llama-3.1-8b-instruct-imat-Q4_K_M.gguf' ERR [ load_model] unable to load model | tid="9324" timestamp=1722020543 model="C:\Users\user123\Downloads\meta-llama-3.1-8b-instruct-imat-Q4_K_M.gguf"

Here i have complied the latest version of llama-server with llama-cpp commit id 01245f5b1629075543bc4478418c7d72a0b4b3c7.

bartowski1182 commented 1 month ago

@gformcreation

I believe this is expected, the changes will break compatibility forward and backwards

tristandruyen commented 1 month ago

Hi, guys can anyone help me out, while trying to load the latest gguf which has a fix mentioned by @tristandruyen i am getting the following error. [...] Here i have complied the latest version of llama-server with llama-cpp commit id 01245f5.

As @bartowski1182 already said this is expected, commit 01245f5 is the latest master, and does not include the rope scaling fixes from #8676, follow the steps from here to add the fixes into your local llama.cpp.

qnixsynapse commented 1 month ago

Interesting change: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/53/files

fedric95 commented 1 month ago

I made some experiments for the 8B quantized base model:

Quantization starting from FP16

git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B python ./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype f16 --outfile Meta-Llama-3.1-8B.FP16.gguf python ./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype q8_0 --outfile Meta-Llama-3.1-8B-Q8_0.gguf ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q6_K.gguf Q6_K ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q5_K_S.gguf Q5_K_S ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q5_K_M.gguf Q5_K_M ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q4_K_M.gguf Q4_K_M ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q4_K_S.gguf Q4_K_S ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q3_K_L.gguf Q3_K_L ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q3_K_M.gguf Q3_K_M ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q3_K_S.gguf Q3_K_S ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q2_K.gguf Q2_K

Perplexity

./llama-perplexity -m Meta-Llama-3.1-8B.FP16.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q8_0.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_L.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q2_K.gguf -f wikitext-2-raw/wiki.test.raw

Model	Perplexity
FP16	6.4016 +/- 0.03939
Q8_0	6.4070 +/- 0.03941
Q6_K	6.4231 +/- 0.03957
Q5_K_M	6.4623 +/- 0.03986
Q5_K_S	6.5173 +/- 0.04029
Q4_K_M	6.5829 +/- 0.04067
Q4_K_S	6.6742 +/- 0.04124
Q3_K_L	6.9461 +/- 0.04328
Q3_K_M	7.0468 +/- 0.04381
Q3_K_S	7.8823 +/- 0.04920
Q2_K	9.7242 +/- 0.06390

Quantization starting from BF16 (UPDATE)

git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B python ./llama.cpp/convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype bf16 --outfile Meta-Llama-3.1-8B.BF16.gguf ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q6_K.gguf Q6_K ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q5_K_S.gguf Q5_K_S ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q5_K_M.gguf Q5_K_M ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q4_K_M.gguf Q4_K_M ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q4_K_S.gguf Q4_K_S ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q3_K_L.gguf Q3_K_L ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q3_K_M.gguf Q3_K_M ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q3_K_S.gguf Q3_K_S ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q2_K.gguf Q2_K

Perplexity

./llama-perplexity -m Meta-Llama-3.1-8B-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_L.gguf -f ../wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_M.gguf -f ../wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_S.gguf -f ../wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q2_K.gguf -f ../wikitext-2-raw/wiki.test.raw

Model	Perplexity
BF16	6.4006 +/- 0.03938
Q6_K	6.4231 +/- 0.03957
Q5_K_M	6.4623 +/- 0.03987
Q5_K_S	6.5161 +/- 0.04028
Q4_K_M	6.5837 +/- 0.04068
Q4_K_S	6.6751 +/- 0.04125
Q3_K_L	6.9458 +/- 0.04329
Q3_K_M	7.0488 +/- 0.04384
Q3_K_S	7.8823 +/- 0.04920
Q2_K	9.7262 +/- 0.06393

mirek190 commented 1 month ago

I made some experiments for the 8B quantized base model:

Quantization

git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B python ./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype f16 --outfile Meta-Llama-3.1-8B.FP16.gguf python ./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype q8_0 --outfile Meta-Llama-3.1-8B-Q8_0.gguf ./llama-quantize Meta-Llama-3.1-8B.FP16.gguf Meta-Llama-3.1-8B-Q6_K.gguf Q6_K ./llama-quantize Meta-Llama-3.1-8B.FP16.gguf Meta-Llama-3.1-8B-Q4_K_M.gguf Q4_K_M ./llama-quantize Meta-Llama-3.1-8B.FP16.gguf Meta-Llama-3.1-8B-Q3_K_M.gguf Q3_K_M ./llama-quantize Meta-Llama-3.1-8B.FP16.gguf Meta-Llama-3.1-8B-Q3_K_S.gguf Q3_K_S ./llama-quantize Meta-Llama-3.1-8B.FP16.gguf Meta-Llama-3.1-8B-Q2_K.gguf Q2_K

Perplexity

./llama-perplexity -m Meta-Llama-3.1-8B.FP16.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q8_0.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q2_K.gguf -f wikitext-2-raw/wiki.test.raw

Model Perplexity FP16 6.4016 +/- 0.03939 Q8_0 6.4070 +/- 0.03941 Q6_K 6.4231 +/- 0.03957 Q4_K_M 6.5829 +/- 0.04067 Q3_K_M 7.0468 +/- 0.04381 Q3_K_S 7.8823 +/- 0.04920 Q2_K 9.7242 +/- 0.06390

can you also add IQ1xx , IQ2xx, IQ3xx and IQ4xx also?

fedric95 commented 1 month ago

I made some experiments for the 8B quantized base model:

Quantization

git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B python ./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype f16 --outfile Meta-Llama-3.1-8B.FP16.gguf python ./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype q8_0 --outfile Meta-Llama-3.1-8B-Q8_0.gguf ./llama-quantize Meta-Llama-3.1-8B.FP16.gguf Meta-Llama-3.1-8B-Q6_K.gguf Q6_K ./llama-quantize Meta-Llama-3.1-8B.FP16.gguf Meta-Llama-3.1-8B-Q4_K_M.gguf Q4_K_M ./llama-quantize Meta-Llama-3.1-8B.FP16.gguf Meta-Llama-3.1-8B-Q3_K_M.gguf Q3_K_M ./llama-quantize Meta-Llama-3.1-8B.FP16.gguf Meta-Llama-3.1-8B-Q3_K_S.gguf Q3_K_S ./llama-quantize Meta-Llama-3.1-8B.FP16.gguf Meta-Llama-3.1-8B-Q2_K.gguf Q2_K

Perplexity

./llama-perplexity -m Meta-Llama-3.1-8B.FP16.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q8_0.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q2_K.gguf -f wikitext-2-raw/wiki.test.raw Model Perplexity FP16 6.4016 +/- 0.03939 Q8_0 6.4070 +/- 0.03941 Q6_K 6.4231 +/- 0.03957 Q4_K_M 6.5829 +/- 0.04067 Q3_K_M 7.0468 +/- 0.04381 Q3_K_S 7.8823 +/- 0.04920 Q2_K 9.7242 +/- 0.06390

can you also add IQ1xx , IQ2xx, IQ3xx and IQ4xx also?

I will try in the next days. Right now I am processing some additional Q3, Q4 and the Q5s.

RodriMora commented 1 month ago

./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype f16 --outfile Meta-Llama-3.1-8B.FP16.gguf

I think you would need to convert to bf16 or fp32 to have better precision, instead of fp16

fedric95 commented 1 month ago

./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype f16 --outfile Meta-Llama-3.1-8B.FP16.gguf

I think you would need to convert to bf16 or fp32 to have better precision, instead of fp16

You are right, the difference should be very small, a comparison of the difference between fp16 and bf16 was made for llama 3 and it was negligible.

I think that I will repeat the experiments. At least it will be interesting to compare the results of the quantization starting from fp16 to the quantization starting fp32/bf16

bopm commented 1 month ago

Always getting a proper answer - 36

Locally with llama 3.1 8b ( q8 ) - hardly getting proper answer every 5 attempts .

I am the only one who sees it still even now?

bartowski1182 commented 1 month ago

@bopm you're seeing it work locally or not work locally?

If it's not working, can you provide your exact commands?

bopm commented 1 month ago

@bartowski1182 never mind, it seems like Ollama issue in fact.

bopm commented 1 month ago

I reported it to Ollama repo, as it done pretty decent job for me on a single run, but now it's not feeling good for llama.cpp too. Details are in the issue here. Here is good run:

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
You started with 10 apples. You lost 4, so you had 10 - 4 = 6 apples.

You then gained a gold coin. This coin will buy you 1 / 0.5 = 2 apples.

You also gained 6 gold coins from the online game, and had to share them equally with your 2 teammates. So you got 6 / 3 = 2 gold coins.

With 2 gold coins, you could buy 2 / 0.5 = 4 more apples.

So in total, you had 6 + 2 + 4 = 12 apples.

The river was near the city of Paris, where the Eiffel Tower is located.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
Let's break it down step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river, so you have 3 gold coins.
3. The river runs near a city that is known for its gold (e.g. Gold Coast, California; Gold Beach, California; Golden, Colorado; etc.). Let's say it's the Gold Coast, California.
4. You lose 4 apples, so you have 10 - 4 = 6 apples.
5. You gain 1 gold coin, so you have 3 + 1 = 4 gold coins.
6. Three birds drop 6 apples each, so you gain 18 apples, making a total of 6 + 18 = 24 apples.
7. You play an online game and win 6 gold coins, but you have to share them equally with your 2 teammates, so you get 6 / 3 = 2 gold coins.
8. You already had 4 gold coins, so now you have 4 + 2 = 6 gold coins.
9. You buy apples for all the coins you have. Since the price of an apple is 0.5 coins, you can buy 6 / 0.5 = 12 apples with your 6 gold coins.
10. However, you already had 24 apples, so now you have 24 + 12 = 36 apples.

So, you have 36 apples, and the river is near the Gold Coast, California.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
Let's break it down step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river. The river runs near a city that has something to do with coins (e.g. a city with a casino, a city known for its mint, etc.). Let's say it's Las Vegas, where you can spend the coins on games and slots.
3. You lose 4 apples, so you have 10 - 4 = 6 apples.
4. You gain 1 gold coin, so you have 3 + 1 = 4 gold coins.
5. Three birds drop 6 apples each, so you gain 18 apples, making a total of 6 + 18 = 24 apples.
6. You play an online game and win 6 gold coins, but you have to share them equally with your 2 teammates, so you get 6 / 3 = 2 gold coins. You already had 4 gold coins, so now you have 4 + 2 = 6 gold coins.
7. You buy apples for all the coins you have. Since the price of an apple is 0.5 coins, you can buy 6 / 0.5 = 12 apples with your 6 gold coins. You already have 24 apples, so now you have 24 + 12 = 36 apples.

So, you have 36 apples, and the river is near Las Vegas.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
Let's break it down step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river. The river runs near a big city that has something to do with gold (e.g. a city with a lot of gold, a city famous for its gold coins, etc.). Let's say it's a city that has a lot of gold in it, so you can spend the coins in a gold shop.
3. You lose 4 apples, so you have 10 - 4 = 6 apples.
4. You gain 1 gold coin, so you have 3 + 1 = 4 gold coins.
5. Three birds drop 6 apples each, so you gain 18 apples, making a total of 6 + 18 = 24 apples.
6. You play an online game and win 6 gold coins, but you have to share them equally with your 2 teammates, so you get 6 / 3 = 2 gold coins. You already had 4 gold coins, so now you have 4 + 2 = 6 gold coins.
7. You buy apples for all the coins you have. Since the price of an apple is 0.5 coins, you can buy 4 / 0.5 = 8 apples with your 4 gold coins, and 2 / 0.5 = 4 apples with your 2 gold coins. You already have 24 apples, so now you have 24 + 8 + 4 = 36 apples.

So, you have 36 apples.

As for where the river is, it could be near a city that has a lot of gold in it. Some possibilities could be:

* A city with a lot of gold mines (e.g. Johannesburg, South Africa)
* A city with a famous gold market or gold exchange (e.g. Shanghai, China)
* A city that is known for its gold coins or gold jewelry (e.g. Dubai, UAE)

But without more information, it's impossible to say for sure where the river is.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
Let's break it down step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river. The river runs near a city that has something to do with what you can spend the coins on. Since you find coins, it's likely a city with casinos or online gaming (e.g. Las Vegas, Macau).
3. You lose 4 apples, so you have 10 - 4 = 6 apples.
4. You gain 1 gold coin, so you have 3 + 1 = 4 gold coins.
5. Three birds drop 6 apples each, so you gain 18 apples, making a total of 6 + 18 = 24 apples.
6. You play an online game and win 6 gold coins, but you have to share them equally with your 2 teammates, so you get 6 / 3 = 2 gold coins. You already had 4 gold coins, so now you have 4 + 2 = 6 gold coins.
7. You buy apples for all the coins you have. Since the price of an apple is 0.5 coins, you can buy 6 / 0.5 = 12 apples with your 6 gold coins. You already have 24 apples, so now you have 24 + 12 = 36 apples.

So, you have 36 apples.

As for where the river is, it's likely near a city with casinos or online gaming, such as:

* Las Vegas, USA
* Macau, China
* Monte Carlo, Monaco
* Singapore

mirek190 commented 1 month ago

I reported it to Ollama repo, as it done pretty decent job for me on a single run, but now it's not feeling good for llama.cpp too. Details are in the issue here. Here is good run:

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
You started with 10 apples. You lost 4, so you had 10 - 4 = 6 apples.

You then gained a gold coin. This coin will buy you 1 / 0.5 = 2 apples.

You also gained 6 gold coins from the online game, and had to share them equally with your 2 teammates. So you got 6 / 3 = 2 gold coins.

With 2 gold coins, you could buy 2 / 0.5 = 4 more apples.

So in total, you had 6 + 2 + 4 = 12 apples.

The river was near the city of Paris, where the Eiffel Tower is located.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
Let's break it down step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river, so you have 3 gold coins.
3. The river runs near a city that is known for its gold (e.g. Gold Coast, California; Gold Beach, California; Golden, Colorado; etc.). Let's say it's the Gold Coast, California.
4. You lose 4 apples, so you have 10 - 4 = 6 apples.
5. You gain 1 gold coin, so you have 3 + 1 = 4 gold coins.
6. Three birds drop 6 apples each, so you gain 18 apples, making a total of 6 + 18 = 24 apples.
7. You play an online game and win 6 gold coins, but you have to share them equally with your 2 teammates, so you get 6 / 3 = 2 gold coins.
8. You already had 4 gold coins, so now you have 4 + 2 = 6 gold coins.
9. You buy apples for all the coins you have. Since the price of an apple is 0.5 coins, you can buy 6 / 0.5 = 12 apples with your 6 gold coins.
10. However, you already had 24 apples, so now you have 24 + 12 = 36 apples.

So, you have 36 apples, and the river is near the Gold Coast, California.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
Let's break it down step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river. The river runs near a city that has something to do with coins (e.g. a city with a casino, a city known for its mint, etc.). Let's say it's Las Vegas, where you can spend the coins on games and slots.
3. You lose 4 apples, so you have 10 - 4 = 6 apples.
4. You gain 1 gold coin, so you have 3 + 1 = 4 gold coins.
5. Three birds drop 6 apples each, so you gain 18 apples, making a total of 6 + 18 = 24 apples.
6. You play an online game and win 6 gold coins, but you have to share them equally with your 2 teammates, so you get 6 / 3 = 2 gold coins. You already had 4 gold coins, so now you have 4 + 2 = 6 gold coins.
7. You buy apples for all the coins you have. Since the price of an apple is 0.5 coins, you can buy 6 / 0.5 = 12 apples with your 6 gold coins. You already have 24 apples, so now you have 24 + 12 = 36 apples.

So, you have 36 apples, and the river is near Las Vegas.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
Let's break it down step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river. The river runs near a big city that has something to do with gold (e.g. a city with a lot of gold, a city famous for its gold coins, etc.). Let's say it's a city that has a lot of gold in it, so you can spend the coins in a gold shop.
3. You lose 4 apples, so you have 10 - 4 = 6 apples.
4. You gain 1 gold coin, so you have 3 + 1 = 4 gold coins.
5. Three birds drop 6 apples each, so you gain 18 apples, making a total of 6 + 18 = 24 apples.
6. You play an online game and win 6 gold coins, but you have to share them equally with your 2 teammates, so you get 6 / 3 = 2 gold coins. You already had 4 gold coins, so now you have 4 + 2 = 6 gold coins.
7. You buy apples for all the coins you have. Since the price of an apple is 0.5 coins, you can buy 4 / 0.5 = 8 apples with your 4 gold coins, and 2 / 0.5 = 4 apples with your 2 gold coins. You already have 24 apples, so now you have 24 + 8 + 4 = 36 apples.

So, you have 36 apples.

As for where the river is, it could be near a city that has a lot of gold in it. Some possibilities could be:

* A city with a lot of gold mines (e.g. Johannesburg, South Africa)
* A city with a famous gold market or gold exchange (e.g. Shanghai, China)
* A city that is known for its gold coins or gold jewelry (e.g. Dubai, UAE)

But without more information, it's impossible to say for sure where the river is.

> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
Let's break it down step by step:

1. You start with 10 apples.
2. You find 3 gold coins in the river. The river runs near a city that has something to do with what you can spend the coins on. Since you find coins, it's likely a city with casinos or online gaming (e.g. Las Vegas, Macau).
3. You lose 4 apples, so you have 10 - 4 = 6 apples.
4. You gain 1 gold coin, so you have 3 + 1 = 4 gold coins.
5. Three birds drop 6 apples each, so you gain 18 apples, making a total of 6 + 18 = 24 apples.
6. You play an online game and win 6 gold coins, but you have to share them equally with your 2 teammates, so you get 6 / 3 = 2 gold coins. You already had 4 gold coins, so now you have 4 + 2 = 6 gold coins.
7. You buy apples for all the coins you have. Since the price of an apple is 0.5 coins, you can buy 6 / 0.5 = 12 apples with your 6 gold coins. You already have 24 apples, so now you have 24 + 12 = 36 apples.

So, you have 36 apples.

As for where the river is, it's likely near a city with casinos or online gaming, such as:

* Las Vegas, USA
* Macau, China
* Monte Carlo, Monaco
* Singapore

Your command looks similar like that?

llama-cli.exe --model models/new3/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 10000 --interactive -ngl 99 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template llama3

bopm commented 1 month ago

With these exact params

llama-cli -m /Users/sergeymoiseev/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 --color --threads 30 --keep -1 --n-predict -1 --ctx-size 10000 --interactive -ngl 99 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.6 --chat-template llama3 -cnv

it's still hallucinating results like 17, 30, 29, and so on.

mirek190 commented 1 month ago

Try with temp 0 . Still hallucinations?

bartowski1182 commented 1 month ago

Maybe try an imatrix quant? My imatrix q4_k_m gets this right every time even without a low temperature

fedric95 commented 1 month ago

./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype f16 --outfile Meta-Llama-3.1-8B.FP16.gguf

I think you would need to convert to bf16 or fp32 to have better precision, instead of fp16

previous comment updated with the BF16 experiments.

bopm commented 1 month ago

Try with temp 0 . Still hallucinations?

You started with 10 apples. You lost 4 apples, so you have 10 - 4 = 6 apples. Then, 3 birds dropped 6 apples each, so you gained 3 * 6 = 18 apples. Now you have 6 + 18 = 24 apples.

You found 3 gold coins in the river, lost 1, and gained 6. You have 3 - 1 + 6 = 8 gold coins. You can spend the coins in the city of Paris, which is famous for its gold coins, the French currency, the "franc".

bopm commented 1 month ago

Maybe try an imatrix quant?

Yep, way better only mistaken on first run, given me 32, than stable 36 all next retries.

mirek190 commented 1 month ago

Maybe try an imatrix quant?

Yep, way better only mistaken on first run, given me 32, than stable 36 all next retries.

with -temp 0?

bopm commented 1 month ago

with -temp 0?

With -temp 0.6

gavin-edward commented 1 month ago

I made some experiments for the 8B quantized base model:

Quantization starting from FP16

git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B python ./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype f16 --outfile Meta-Llama-3.1-8B.FP16.gguf python ./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype q8_0 --outfile Meta-Llama-3.1-8B-Q8_0.gguf ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q6_K.gguf Q6_K ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q5_K_S.gguf Q5_K_S ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q5_K_M.gguf Q5_K_M ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q4_K_M.gguf Q4_K_M ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q4_K_S.gguf Q4_K_S ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q3_K_L.gguf Q3_K_L ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q3_K_M.gguf Q3_K_M ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q3_K_S.gguf Q3_K_S ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q2_K.gguf Q2_K

Perplexity

./llama-perplexity -m Meta-Llama-3.1-8B.FP16.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q8_0.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_L.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q2_K.gguf -f wikitext-2-raw/wiki.test.raw

Model Perplexity FP16 6.4016 +/- 0.03939 Q8_0 6.4070 +/- 0.03941 Q6_K 6.4231 +/- 0.03957 Q5_K_M 6.4623 +/- 0.03986 Q5_K_S 6.5173 +/- 0.04029 Q4_K_M 6.5829 +/- 0.04067 Q4_K_S 6.6742 +/- 0.04124 Q3_K_L 6.9461 +/- 0.04328 Q3_K_M 7.0468 +/- 0.04381 Q3_K_S 7.8823 +/- 0.04920 Q2_K 9.7242 +/- 0.06390

Quantization starting from BF16 (UPDATE)

git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B python ./llama.cpp/convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype bf16 --outfile Meta-Llama-3.1-8B.BF16.gguf ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q6_K.gguf Q6_K ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q5_K_S.gguf Q5_K_S ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q5_K_M.gguf Q5_K_M ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q4_K_M.gguf Q4_K_M ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q4_K_S.gguf Q4_K_S ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q3_K_L.gguf Q3_K_L ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q3_K_M.gguf Q3_K_M ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q3_K_S.gguf Q3_K_S ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q2_K.gguf Q2_K

Perplexity

./llama-perplexity -m Meta-Llama-3.1-8B-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_L.gguf -f ../wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_M.gguf -f ../wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_S.gguf -f ../wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q2_K.gguf -f ../wikitext-2-raw/wiki.test.raw

Model Perplexity BF16 6.4006 +/- 0.03938 Q6_K 6.4231 +/- 0.03957 Q5_K_M 6.4623 +/- 0.03987 Q5_K_S 6.5161 +/- 0.04028 Q4_K_M 6.5837 +/- 0.04068 Q4_K_S 6.6751 +/- 0.04125 Q3_K_L 6.9458 +/- 0.04329 Q3_K_M 7.0488 +/- 0.04384 Q3_K_S 7.8823 +/- 0.04920 Q2_K 9.7262 +/- 0.06393

hello, have a little question, where is llama-quantize? need to build by self.

fedric95 commented 1 month ago

I made some experiments for the 8B quantized base model:

Quantization starting from FP16

git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B python ./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype f16 --outfile Meta-Llama-3.1-8B.FP16.gguf python ./convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype q8_0 --outfile Meta-Llama-3.1-8B-Q8_0.gguf ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q6_K.gguf Q6_K ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q5_K_S.gguf Q5_K_S ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q5_K_M.gguf Q5_K_M ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q4_K_M.gguf Q4_K_M ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q4_K_S.gguf Q4_K_S ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q3_K_L.gguf Q3_K_L ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q3_K_M.gguf Q3_K_M ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q3_K_S.gguf Q3_K_S ./llama-quantize ../Meta-Llama-3.1-8B.FP16.gguf ../Meta-Llama-3.1-8B-Q2_K.gguf Q2_K

Perplexity

./llama-perplexity -m Meta-Llama-3.1-8B.FP16.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q8_0.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_L.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q2_K.gguf -f wikitext-2-raw/wiki.test.raw Model Perplexity FP16 6.4016 +/- 0.03939 Q8_0 6.4070 +/- 0.03941 Q6_K 6.4231 +/- 0.03957 Q5_K_M 6.4623 +/- 0.03986 Q5_K_S 6.5173 +/- 0.04029 Q4_K_M 6.5829 +/- 0.04067 Q4_K_S 6.6742 +/- 0.04124 Q3_K_L 6.9461 +/- 0.04328 Q3_K_M 7.0468 +/- 0.04381 Q3_K_S 7.8823 +/- 0.04920 Q2_K 9.7242 +/- 0.06390

Quantization starting from BF16 (UPDATE)

git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B python ./llama.cpp/convert_hf_to_gguf.py Meta-Llama-3.1-8B --outtype bf16 --outfile Meta-Llama-3.1-8B.BF16.gguf ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q6_K.gguf Q6_K ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q5_K_S.gguf Q5_K_S ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q5_K_M.gguf Q5_K_M ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q4_K_M.gguf Q4_K_M ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q4_K_S.gguf Q4_K_S ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q3_K_L.gguf Q3_K_L ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q3_K_M.gguf Q3_K_M ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q3_K_S.gguf Q3_K_S ./llama-quantize Meta-Llama-3.1-8B.BF16.gguf Meta-Llama-3.1-8B-Q2_K.gguf Q2_K

Perplexity

./llama-perplexity -m Meta-Llama-3.1-8B-Q6_K.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q5_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q4_K_S.gguf -f wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_L.gguf -f ../wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_M.gguf -f ../wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q3_K_S.gguf -f ../wikitext-2-raw/wiki.test.raw ./llama-perplexity -m Meta-Llama-3.1-8B-Q2_K.gguf -f ../wikitext-2-raw/wiki.test.raw Model Perplexity BF16 6.4006 +/- 0.03938 Q6_K 6.4231 +/- 0.03957 Q5_K_M 6.4623 +/- 0.03987 Q5_K_S 6.5161 +/- 0.04028 Q4_K_M 6.5837 +/- 0.04068 Q4_K_S 6.6751 +/- 0.04125 Q3_K_L 6.9458 +/- 0.04329 Q3_K_M 7.0488 +/- 0.04384 Q3_K_S 7.8823 +/- 0.04920 Q2_K 9.7262 +/- 0.06393

hello, have a little question, where is llama-quantize? need to build by self.

you can call it from the directory of llama.cpp https://github.com/ggerganov/llama.cpp/pull/7809 llama-quantize is just the renamed version of quantize. I have created a repo in HF with all the quantized models: https://huggingface.co/fedric95/Meta-Llama-3.1-8B-GGUF

fairydreaming commented 1 month ago

FYI after merging #8858 it's now possible to handle <|python_tag|> tool calls since generation now stops after <|eom_id|>. I've been playing with Python tool calls in Llama 3.1 running on llama.cpp server for the past few days and initial results are very encouraging. Even the smallest Llama 3.1 8B has no problems with this. Here's an example conversation I had with this model: https://pastebin.com/N0rz3yZj

steampunque commented 1 month ago

FYI after merging #8858 it's now possible to handle <|python_tag|> tool calls since generation now stops after <|eom_id|>. I've been playing with Python tool calls in Llama 3.1 running on llama.cpp server for the past few days and initial results are very encouraging. Even the smallest Llama 3.1 8B has no problems with this. Here's an example conversation I had with this model: https://pastebin.com/N0rz3yZj

Its also possible to use custom tool calls https://huggingface.co/blog/llama31#built-in-tool-calling, avoiding the need for ipython shell and eom stuff. Ask the model to make a python code block based on your query, extract it, run it, send its output into the conversation as described in the link. Works fine:

bash-5.1$ ./lmf Is it hotter in NYC, Austin, or Houston right now. CAT get_current_conditions nohup: redirecting stderr to stdout TOOL : {'coord': {'lon': -74.006, 'lat': 40.7143}, 'weather': [{'id': 803, 'main': 'Clouds', 'description': 'broken clouds', 'icon': '04d'}], 'base': 'stations', 'main': {'temp': 86.14, 'feels_like': 93.7, 'temp_min': 80.56, 'temp_max': 91.99, 'pressure': 1011, 'humidity': 66, 'sea_level': 1011, 'grnd_level': 1010}, 'visibility': 10000, 'wind': {'speed': 17, 'deg': 167, 'gust': 20}, 'clouds': {'all': 75}, 'dt': 1722973854, 'sys': {'type': 1, 'id': 4610, 'country': 'US', 'sunrise': 1722938278, 'sunset': 1722989146}, 'timezone': -14400, 'id': 5128581, 'name': 'New York', 'cod': 200} {'coord': {'lon': -97.7431, 'lat': 30.2672}, 'weather': [{'id': 800, 'main': 'Clear', 'description': 'clear sky', 'icon': '01d'}], 'base': 'stations', 'main': {'temp': 98.26, 'feels_like': 107.65, 'temp_min': 96.6, 'temp_max': 101.12, 'pressure': 1014, 'humidity': 43, 'sea_level': 1014, 'grnd_level': 990}, 'visibility': 10000, 'wind': {'speed': 8.05, 'deg': 100, 'gust': 16.11}, 'clouds': {'all': 0}, 'dt': 1722973538, 'sys': {'type': 2, 'id': 2008738, 'country': 'US', 'sunrise': 1722945174, 'sunset': 1722993643}, 'timezone': -18000, 'id': 4671654, 'name': 'Austin', 'cod': 200} {'coord': {'lon': -95.3633, 'lat': 29.7633}, 'weather': [{'id': 802, 'main': 'Clouds', 'description': 'scattered clouds', 'icon': '03d'}], 'base': 'stations', 'main': {'temp': 96.57, 'feels_like': 109.17, 'temp_min': 93.96, 'temp_max': 99.14, 'pressure': 1014, 'humidity': 52, 'sea_level': 1014, 'grnd_level': 1011}, 'visibility': 10000, 'wind': {'speed': 1.01, 'deg': 224, 'gust': 5.01}, 'clouds': {'all': 40}, 'dt': 1722973718, 'sys': {'type': 2, 'id': 2001415, 'country': 'US', 'sunrise': 1722944653, 'sunset': 1722993022}, 'timezone': -18000, 'id': 4699066, 'name': 'Houston', 'cod': 200} TOOL DONE Based on the API responses, the current temperature in NYC is 86.14°F, in Austin is 98.26°F, and in Houston is 96.57°F. Therefore, it is hotter in Austin right now.

AbdullahHameedKhan commented 1 month ago

...yes currently llama 3.1 8b seems a bit dumber than llama 3 8b ... I do not know it is a gguf problem of llamacpp itself. For instance question "I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?" with https://groq.com/ Always getting a proper answer - 36 Locally with llama 3.1 8b ( q8 ) - hardly getting proper answer every 5 attempts .

Same observation here. Not sure if it's issue with the model, or llama.cpp (tested Q6_K quant with b3438), but for now 3.1 feels way worse than 3.0:

Temperature 0 with both those fails. Tested with empy system, and with "You're a helpful assistant." - none of those works well. Tried with -c 8192 and -c 16384 - similar results.

Update the generation_config file. This is a template issue not a model issue of Llama 3.1