Feature Request: Proper Llama 3.1 Support in llama.cpp

Vaibhavs10 commented 1 month ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Llama 3.1 was just released and it is a significant leg up from the previous series of models: https://huggingface.co/blog/llama31

Whilst the overall architecture is the same, it requires some modelling updates, primarily around RoPE scaling: https://github.com/huggingface/transformers/blob/bc2adb0112b6677b0dfb4105c74570a0f92183eb/src/transformers/modeling_rope_utils.py#L298

It'd be great to add support for those so that the generations are more coherent and make sense.

Motivation

Note: Without the modelling changes, the generation might look coherent, but they are far from great and the true-st potential of the model!

Possible Implementation

Here's the corresponding transformers implementation: https://github.com/huggingface/transformers/blob/bc2adb0112b6677b0dfb4105c74570a0f92183eb/src/transformers/modeling_rope_utils.py#L298

qnixsynapse commented 1 month ago

Also, adding to this, a proper function calling support in the server since llama 3.1 now supports tooling/function calling.

Dampfinchen commented 1 month ago

mirek190 commented 1 month ago

so ? how a proper template is now?

tristandruyen commented 1 month ago

so ? how a proper template is now?

The new template seems to not use the new eos-token so the existing templates should work fine AFAIK. It might only be used for tool calls or something like that, not sure yet...

ngxson commented 1 month ago

Also, adding to this, a proper function calling support in the server since llama 3.1 now supports tooling/function calling.

IMO support for function calling can be done easier (and more stable) when using python, for example via llama-cpp-python

I tried implementing the same thing for functionary model before, but the code is very hard to maintain.

Edit: yeah so people seem to misunderstand my point. What I'm trying to say is: in reality, most models are trained to call tools in python language, so the tool must be in python from the beginning.

m18coppola commented 1 month ago

Converting llama-3.1 seems to make it set the tokenizer.ggml.pre = 'smaug-bpe' instead of llama-bpe.

mirek190 commented 1 month ago

...yes currently llama 3.1 8b seems a bit dumber than llama 3 8b ... I do not know it is a gguf problem of llamacpp itself.

For instance

question "I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?"

with https://groq.com/

Always getting a proper answer - 36

Locally with llama 3.1 8b ( q8 ) - hardly getting proper answer every 5 attempts .

RodriMora commented 1 month ago

...yes currently llama 3.1 8b seems a bit dumber than llama 3 8b ... I do not know it is a gguf problem of llamacpp itself.

For instance

question "I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?"

with https://groq.com/

Always getting a proper answer - 36

Locally with llama 3.1 8b ( q8 ) - hardly getting proper answer every 5 attempts .

do you know what parameters are using in groq? maybe they have lower temperature?

Edit: just tested with Q8_0 at temp 0.0 and gave me the correct result each time. But usually fails at higher temps

dranger003 commented 1 month ago

There seems to be a change in the way RoPE is used, see: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/commit/13f04ed6f85ef2aa2fd11b960a275c3e31a8069e

Also, for long context the model isn't working unless I use 8000000 as RoPE base frequency for 48K context (just an example).

21 | +   "rope_scaling": {
22 | +     "factor": 8.0,
23 | +     "low_freq_factor": 1.0,
24 | +     "high_freq_factor": 4.0,
25 | +     "original_max_position_embeddings": 8192,
26 | +     "rope_type": "llama3"
27 | +   },
28 | "rope_theta": 500000.0,

MoonRide303 commented 1 month ago

...yes currently llama 3.1 8b seems a bit dumber than llama 3 8b ... I do not know it is a gguf problem of llamacpp itself.

For instance

question "I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?"

with https://groq.com/

Always getting a proper answer - 36

Locally with llama 3.1 8b ( q8 ) - hardly getting proper answer every 5 attempts .

Same observation here. Not sure if it's issue with the model, or llama.cpp (tested Q6_K quant with b3438), but for now 3.1 feels way worse than 3.0:

Temperature 0 with both those fails. Tested with empy system, and with "You're a helpful assistant." - none of those works well. Tried with -c 8192 and -c 16384 - similar results.

fairydreaming commented 1 month ago

I did some local tests of Q8_0 8B model in llama.cpp with 4096 context size and with low temperature set (0.01) it often enters generation loops repeating the same sentences over and over. I noticed the same problem with this model when using OpenRouter API. Attached is an example prompt causing problems: prompt-llama-3.1.txt

Command line: ./llama-cli --numa distribute -t 32 -s 42 -c 4096 --temp 0.01 -m models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -f prompt-llama-3.1.txt

It happens also when using CUDA backend: ./llama-cli -t 1 -ngl 33 -s 42 -c 4096 --temp 0.01 -m models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -f prompt-llama-3.1.txt

Did anyone experience similar problems?

mirek190 commented 1 month ago

...yes currently llama 3.1 8b seems a bit dumber than llama 3 8b ... I do not know it is a gguf problem of llamacpp itself. For instance question "I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?" with https://groq.com/ Always getting a proper answer - 36 Locally with llama 3.1 8b ( q8 ) - hardly getting proper answer every 5 attempts .

do you know what parameters are using in groq? maybe they have lower temperature?

Edit: just tested with Q8_0 at temp 0.0 and gave me the correct result each time. But usually fails at higher temps

giving to temp 0 always getting 34

Screenshot 2024-07-23 214325

mirek190 commented 1 month ago

...yes currently llama 3.1 8b seems a bit dumber than llama 3 8b ... I do not know it is a gguf problem of llamacpp itself. For instance question "I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?" with https://groq.com/ Always getting a proper answer - 36 Locally with llama 3.1 8b ( q8 ) - hardly getting proper answer every 5 attempts .

Same observation here. Not sure if it's issue with the model, or llama.cpp (tested Q6_K quant with b3438), but for now 3.1 feels way worse than 3.0:

Temperature 0 with both those fails. Tested with empy system, and with "You're a helpful assistant." - none of those works well. Tried with -c 8192 and -c 16384 - similar results.

yes ... llama 3.1 8b seems dumber even than llama 3 8b - is something off .... gguf nor llamacpp or both ;)
Tested under groq - there is much smarter than llama 3 8b.

EliEron commented 1 month ago

It looks like they've added a new EOS token called <|eom_id|>, alongside the already existing <|end_of_text|> and <|eot_id|> ones, something to look out for.

The <|eom_id|> token is used specifically during tool calls. It marks the point where the model is done setting up the call and expects the backend to run the tool and provide the results. So it's a bit different from a traditional EOS token in the sense that it does not mark the response as done, the model still has more to generate, but it needs to get a result from the tool call before it can resume its response.

vlbosch commented 1 month ago

I have just converted the model from hf to gguf and then quantized to Q8 with the following extra options: --leave-output-tensor --token-embedding-type f16. Model seems to be responding quite good, especially since I prompt in Dutch exclusively.

m18coppola commented 1 month ago

Converting llama-3.1 seems to make it set the tokenizer.ggml.pre = 'smaug-bpe' instead of llama-bpe.

Investigation has led me to figure out why the smaug-bpe pre-tokenizer was being used instead of the llama-bpe. It seems to be a problem with the transformers library not prefixing a BOS token.

Example:

from transformers import AutoTokenizer

llama_3_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
llama_3_1_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

text = "Hello"

print(llama_3_tokenizer.encode(text))
print(llama_3_1_tokenizer.encode(text))

Output:

[128000, 9906]
[9906]

It seems like the official code prefixes the BOS token.

oldgithubman commented 1 month ago

It looks like they've added a new EOS token called <|eom_id|>, alongside the already existing <|end_of_text|> and <|eot_id|> ones, something to look out for.

ffs

...

Edit - in exchange for function calling is worth it, I suppose

oldgithubman commented 1 month ago

the existing templates should work fine AFAIK

Dangerous assumption

steampunque commented 1 month ago

I did a Q6_K quant. First added the model to convert_hf_to_gguf_update and ran it, still got smaug pretokenizer, so I just replaced smaug in the convert script with llama:

            #res = "smaug-bpe"
            res = "llama-bpe"

Seems to be doing fine. I dont use llama.cpp tokenizer for bos or chat template, I do bos+template myself in a modified server and I do the exact same template as 3.0. Tests:

lm prompt-llama-3.1.txt 
Let's break down the relationships:

* Margaret is Carl's parent.
* Margaret is Lawrence's parent.
* Carl is Denise's parent.
* Catherine is Julie's parent.
* Julie is Christopher's parent.
* Catherine is Lawrence's parent.

So, Catherine is the parent of both Lawrence and Julie. This makes Catherine the grandparent of Lawrence and Julie's child, which is Christopher.

Therefore, Christopher is Lawrence's grandchild.

However, the options don't include "grandchild" directly. The closest option is:

<ANSWER>3</ANSWER>

This option says "Christopher is Lawrence's great grandchild", which is not entirely accurate. Christopher is Lawrence's grandchild, not great grandchild. However, since the options don't include "grandchild" directly, and "great grandchild" is the closest option, I'll select it as the correct answer.
bash-5.1$

Gemma 27b gave this response to the prompt:

bash-5.1$ lm prompt-llama-3.1.txt 
 Based on the given family relationships, Christopher is Julie's parent, and Julie is Catherine's parent. Since Catherine is Lawrence's parent, Christopher is Lawrence's grandchild. Therefore, the correct answer is:

3. Christopher is Lawrence's grandchild.

So the answer is <ANSWER>3</ANSWER>.

Math also looks OK:

bash-5.1$ lm what is 3333+777?
Here's the calculation:

3333 + 777 = 4110
bash-5.1$ lmc +100 to that?
Here's the updated calculation:

4110 + 100 = 4210
bash-5.1$ lmc -5211?
Here's the calculation:

bash-5.1$ lmc -5211?
Here's the calculation:

4210 - 5211 = -1001

The goldcoin thing also works. goldcoin.txt

bash-5.1$ lm goldcoin.txt   
What a delightful and imaginative problem! Let's break it down step by step.

**Step 1: Initial situation**

You have 10 apples.

**Step 2: Find gold coins**

You find 3 gold coins in the river.

**Step 3: Lose apples**

You lose 4 apples, so you now have:
10 - 4 = 6 apples

**Step 4: Gain a gold coin**

You gain a gold coin, so you now have:
3 + 1 = 4 gold coins

**Step 5: Birds drop apples**

Three birds drop 6 apples each, so you receive:
3 x 6 = 18 apples
You now have:
6 (from Step 3) + 18 = 24 apples

**Step 6: Play online game**

You win 6 gold coins, but you have to share them equally with your 2 teammates, so you receive:
6 / 3 = 2 gold coins
You now have:
4 (from Step 4) + 2 = 6 gold coins

**Step 7: Buy apples**

The price of an apple is 0.5 coins, and you have 6 gold coins, so you can buy:
6 / 0.5 = 12 apples
You now have:
24 (from Step 5) + 12 = 36 apples

**The river's location**

Unfortunately, the problem doesn't provide any information about the river's location. It only mentions that it runs near a big city, but the city's name or location is not specified.

**The final answer**

You have 36 apples.

qnixsynapse commented 1 month ago

I tried implementing the same thing for functionary model before, but the code is very hard to maintain.

~~Can you point me to that commit?~~

Edit: @ngxson Got the draft: https://github.com/ggerganov/llama.cpp/pull/5695

Anyways, I have opened a request to the llamacpp-python repository.

steampunque commented 1 month ago

I ran some quick benches on Llama 3.1 and it does look to be giving performance boost over 3. As far as I am aware the long ROPE changes should not impact these benchmarks as my max tokens is 2500 for the test (for CoT). Based on these results I think its running well on llama.cpp for short contexts. (I am running version 3428).

These benches are my own custom prompts, they are not the standard evaluation harness. I zero shot everything and require the model to follow a circularly shifted answer doublecheck prompt to score a success on all MC (TQA2 and BOOLQ are both A/B MC in my runs). This ensures the model actually solidly knew the answer and did not luck out based on random answer positioning.

Gemma 2 9b is still the smartest 8B class model I have ever run. However Llama 3.1 with 128k context becomes very interesting once the long ROPE issue is sorted out. Gemma 2 9b is only 8k context and its context memory has very high overhead (VRAM/token ratio is high).

model	Meta-Llama-3.1-8B-Instruct	Meta-Llama-3-8B-Instruct	gemma-2-9b-it
quant	Q6_K	Q6_K	Q6_K
----------------------------------------	----------------------------	--------------------------	---------------
WG	0.737	0.707	0.762
LAMBADA	0.705	0.710	0.735
HELLASWAG	0.694	0.667	0.775
TQA1	0.556	0.507	0.701
TQA2	0.510	0.504	0.692
BOOLQ	0.612	0.609	0.687
ARCC	0.776	0.732	0.882
ARCE	0.905	0.883	0.952
RACEM	0.725	0.708	0.849
RACEH	0.678	0.641	0.802
CSQA	0.683	0.639	0.751
OBQA	0.765	0.685	0.846
COPA	0.887	0.886	0.925
PIQA	0.723	0.681	0.801
SIQA	0.647	0.624	0.693
JEOPARDY	0.540	0.370	0.550
GSM8K (Zero shot CoT)	0.870	0.817	0.890
HUMANEVAL	0.664	0.591	0.658

bartowski1182 commented 1 month ago

for the record, I wonder if the recognizing as smaug-bpe is because smaug was the llama 3 tokenizer but with some changes to the post_processor that match what llama 3.1 was released with? So they actually tokenize the same way and that's why the chksum is matching it?

If you look at the tokenizer.json in llama 3, there's a TemplateProcessing step that doesn't exist in smaug and llama 3.1

that said smaug flips the ignore merges flag, so not sure if that would make a bigger difference..

bartowski1182 commented 1 month ago

The more I look the more I feel the smaug-bpe is a non-factor

If you look through the code, the only thing that being labelled smaug-bpe actually does is select the regex for smaug, which is an exact match of what llama 3 uses, so it's the same

It just happens to be that llama 3.1 tokenizes identically to smaug-bpe instead of llama 3, but in the end it doesn't actually matter

bartowski1182 commented 1 month ago

@steampunque can you by chance compare to https://huggingface.co/lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf to see if it's the same? I got the right answer on math and your gold coin question

Nottlespike commented 1 month ago

This may actually an Ollama issue with the modelfile as the config.json is different than expected as per the paper it was changed from 3.0 to 3.1

Noah670 commented 1 month ago

Great feature thank you

joseph777111 commented 1 month ago

The more I look the more I feel the smaug-bpe is a non-factor

If you look through the code, the only thing that being labelled smaug-bpe actually does is select the regex for smaug, which is an exact match of what llama 3 uses, so it's the same

It just happens to be that llama 3.1 tokenizes identically to smaug-bpe instead of llama 3, but in the end it doesn't actually matter

I think you're right @bartowski1182! When, I try to do what @m18coppola , the results are not good. But, when I just convert to GGUF without changing the convert_hf_to_gguf.py, the model seems more intelligent. I think the rope settings, which @dranger003 pointed out, might be messing things up for the model's generations. 🤔

joseph777111 commented 1 month ago

Could part of the problem be caused by wrong generation parameters? LLaMa-3.1-8B-Instruct's generation_config.json states that:

temperature: 0.6 top_p = 0.9

Would this make a difference?

{
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9,
  "transformers_version": "4.42.3"
}

arch-btw commented 1 month ago

There's a merge request for an update to the tokenizer:

https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/28/files

edit: another one:

https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/29

The config.json was also updated 16 hours ago (and another open merge request). I'm not sure if llama.cpp uses config.json during quantization though.

MoonRide303 commented 1 month ago

@arch-btw I tried conversion with updated special_tokens_map.json and tokenizer.json (merged on HF main branch already), but it's still something missing - model just doesn't shut up (tested with b3449). Same problem with temp 0, and temp 0.6.

Green-Sky commented 1 month ago

Is it still different to https://huggingface.co/teknium/Llama-3.1-AlternateTokenizer/blob/main/tokenizer_config.json ?

edit: this might be unrelated and them just preparing a different tokenizer.

joseph777111 commented 1 month ago

What does this change to the config.json do?

https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/26/files

qnixsynapse commented 1 month ago

All these needs to be updated. The convert script needs to put these keys in the gguf file and similarly llama.cpp needs to be updated.

It's not wise to convert right now imo.

fairydreaming commented 1 month ago

The problem I encountered with 8B instruct model entering infinite generation loops happens also in transformers library, so it's not a llama.cpp fault.

EliEron commented 1 month ago

I've also encountered consistent infinite loops in the Transformers version, though in my case it affects the 70B and 405B models. So it seems there's something wrong indeed.

Green-Sky commented 1 month ago

Did someone test the original code + weights ?

https://github.com/meta-llama/llama-models/blob/1b5892739868e5333fb7f022ba91218f0ae5f9c2/models/llama3_1/api/sku_list.py#L40

https://github.com/meta-llama/llama-models/blob/1b5892739868e5333fb7f022ba91218f0ae5f9c2/models/llama3_1/api/model.py#L41-L63

RodriMora commented 1 month ago

I have downloaded the original weights with the updated tokenizer, requantized to gguf q8_0 and run some test using groq as a benchmark. All test for 8B

vllm as backend for the original model+openwebui for frontend: Temperature at 0.5 vllm serve --host 0.0.0.0 --port 5000 --gpu-memory-utilization 0.95 ~/models/Meta-Llama-3.1-8B-Instruct/ --served-model-name gpt-3.5-turbo -tp 4 --max-model-len 8000

llama.cpp http server backend for gguf+openwebui for frontend: Temperature at 0.5 ~/llama.cpp/llama-server -m ~/models/Meta-Llama-3.1-8B-Instruct-q8_0.gguf -c 8000 -ngl 99 --host 0.0.0.0 --port 5000

groq:

Note: asking groq for "output the previous message" to get the system prompt yielded: "Please try to provide useful, helpful and actionable answers." so that's what I used for system prompt.

So right now with the new tokenizer+ limiting the context to 8K, seems to work as expected. No repetition or any other problems

foldl commented 1 month ago

Here is an implementation of llama3 RoPE:

https://github.com/foldl/chatllm.cpp/blob/master/src/custom_ops.cpp#L837-L851

m18coppola commented 1 month ago

@bartowski1182 @joseph777111 The difference between smaug-bpe and llama-bpe is enough to stop llama.cpp from adding the BOS, which is pretty annoying.

I reported the issue, and they fixed it already. Using the convert_hf_to_gguf.py now yields llama-bpe without the need for modification.

qnixsynapse commented 1 month ago

Interesting that we have an ipython role now along with system, user and assistant: https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/

cmp-nct commented 1 month ago

I did a Q6_K quant. First added the model to convert_hf_to_gguf_update and ran it, still got smaug pretokenizer, so I just replaced smaug in the convert script with llama: ...

In general you can use this feature to avoid having to hack the python script: https://github.com/ggerganov/llama.cpp/pull/7959 That allows to specify the tokenizer

steampunque commented 1 month ago

@steampunque can you by chance compare to https://huggingface.co/lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf to see if it's the same? I got the right answer on math and your gold coin question

WG with that quantl is identical to mine. 934 333 1267 0 0 .737

ngxson commented 1 month ago

Interesting that we have an ipython role now along with system, user and assistant: https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/

Good to know. In fact, for this reason, I did not constraint the role to be an enum when the llama_chat_apply_template API was firstly introduced to llama.cpp. Using role-base system for tool call should be the way to do.. I wonder why chatml didn't do this since the beginning.

MoonRide303 commented 1 month ago

Using New UI in llama.cpp server (prompt style: Llama 3, system prompt: None, temperature 0.6) solved generation not stopping issue for me - model seems to be working fine, now.

qnixsynapse commented 1 month ago

@ngxson Will this now be easier to implement? I looked at the code and I saw everything in one place. I wonder it is possible to refactor to make heavy use of C++ classes.

bartowski1182 commented 1 month ago

@m18coppola do we need this change in transformers for convert time or run time?

at convert time, i think it's basically irrelevant if it recognizes as smaug-bpe or llama-bpe.. they both behave identically

so it's good to update the tokenizer for proper recognition, but i don't think it affects the final output in any way

ngxson commented 1 month ago

@qnixsynapse What kind of thing you want to implement?

If you want function calling, then just use it with appropriate role for your message. C++ code only pass your message to the model. No fancy thing here.

After all, there is not much you can do if the model is only trained to call tools in python language. Most models are trained to do so - that's why I suggest to just do it in python instead: https://github.com/ggerganov/llama.cpp/issues/8650#issuecomment-2245939845

qnixsynapse commented 1 month ago

@ngxson Yeah make sense. That's why I opened a FR to the llamacpp-python repository.

ngxson commented 1 month ago

@qnixsynapse llama-cpp-python seems to support function calling out of the box: https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#function-calling

After all, function calling is just a more complicated prompt format so it should be trivial to add. I'm not sure if llama-cpp-python already supported llama3.1 tool call format or not, probably you can have a look and add a PR if needed.

qnixsynapse commented 1 month ago

@ngxson llamacpp-python only supports functionary models and chatml for function calling. For llama 3.1, the template needs to be added. Also, we will need a separate chat_handler like this.

ggerganov / llama.cpp