llama3 family support - Githubissues

gulldan commented 6 months ago

llama3 released would be happy to use with llama.cpp https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6

https://github.com/meta-llama/llama3

maziyarpanahi commented 6 months ago

As far as I can see, it seems to be the same as before from the architecture point of view. There might be some extra stuff to optimize more https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF

johnrachwan123 commented 6 months ago

@maziyarpanahi i am getting a "tokenizer.model" not found error. How did you resolve this ?

MoonRide303 commented 6 months ago

Calling convert-hf-to-gguf.py ends up with FileNotFoundError: Cannot find Llama BPE tokenizer. Trying to use tokenizer.model from original folder in https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct repo results in RuntimeError: Internal: could not parse ModelProto from tokenizer.model.

maziyarpanahi commented 6 months ago

I have the latest pulled and built from a few hours ago. I am getting worried now with all these failed converts! I tested the quants, they work though.

brandon-e2e commented 6 months ago

@maziyarpanahi Have you confirmed that instruct mode works? That's where I'm seeing issues (possibly user error).

Edit: Nevermind, figured out the chat template.

thecivilizedgamer commented 6 months ago

I tried using the quantized instruct model from https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q6_K.gguf, but when I try using it (specifying llama2 chat template) I get odd results, which seem like an issue with the chat template:

Hi! How can I assist you today? [/INST] <>

Wait, no! You're the one who is supposed to respond. I made a mistake! [/INST] <> You are the one who is supposed to respond now. [/INST] <>
I think there was a miscommunication! Can you please clarify what you need help with? [/INST] <> [/INST] I need to create a new Google account and set up my computer. Also, I've got some questions about the Google Home app. [/INST]

MoonRide303 commented 6 months ago

@thecivilizedgamer: chat template changed, look at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2053.

thecivilizedgamer commented 6 months ago

@thecivilizedgamer: chat template changed, look at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2053.

Ooooh thank you, I was looking but didn't see any info about that

thecivilizedgamer commented 6 months ago

@MoonRide303 sorry to bug you, but do you know how to specify the new template format? I assume that eventually it will be added to llama.cpp as one of the predefined templates along with llama2 and chatml, but I'm not sure how to specify it in the meantime.

jxy commented 6 months ago

In their code, the chat format is here: https://github.com/meta-llama/llama3/blob/299bfd8212fec65698c2f8c7b5970cbbb74c2a4f/llama/tokenizer.py#L202

arch-btw commented 6 months ago

python convert.py Meta-Llama-3-8B-Instruct --outtype f16 --vocab-type bpe

Need to add --vocab-type bpe as mentioned here: https://github.com/ggerganov/llama.cpp/pull/6745#issuecomment-2064814034

ddh0 commented 6 months ago

Just pulled latest from master. When trying to convert from HF/safetensors to GGUF using convert-hf-to-gguf.py I get:

Loading model: Meta-Llama-3-8B-Instruct
gguf: This GGUF file is for Little Endian only
Set model parameters
gguf: context length = 8192
gguf: embedding length = 4096
gguf: feed forward length = 14336
gguf: head count = 32
gguf: key-value head count = 8
gguf: rope theta = 500000.0
gguf: rms norm epsilon = 1e-05
gguf: file type = 1
Set model tokenizer
Traceback (most recent call last):
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 1302, in set_vocab
    self. _set_vocab_sentencepiece()
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 330, in _set_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: Meta-Llama-3-8B-Instruct/tokenizer.model

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 2728, in <module>
    main()
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 2715, in main
    model_instance.set_vocab()
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 1304, in set_vocab
    self._set_vocab_llama_hf()
  File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
    vocab = LlamaHfVocab(self.dir_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dylan/Documents/AI/llama.cpp/convert.py", line 532, in __init__
    raise FileNotFoundError('Cannot find Llama BPE tokenizer')
FileNotFoundError: Cannot find Llama BPE tokenizer

When trying to convert from HF/safetensors to GGUF using convert.py I get:

Loading model file Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00002-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00003-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00004-of-00004.safetensors
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('Meta-Llama-3-8B-Instruct'))
Traceback (most recent call last):
  File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1548, in <module>
    main()
  File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1515, in main
    vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1417, in load_vocab
    vocab = self._create_vocab_by_path(vocab_types)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1407, in _create_vocab_by_path
    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']

Hopefully this can be useful as a reference. Thanks!

Jipok commented 6 months ago

thecivilizedgamer commented 6 months ago

Thanks @Jipok! I neglected to mention that I'm using llama.cpp in server mode. Do you know if there is a way to manually specify the chat format in server mode?

The bottom paragraph of https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template makes me think maybe that's not possible. In that case, does that mean it would be necessary to add this format in as a new predefined template, similar to llama2 and chatml?

phymbert commented 6 months ago

Do you know if there is a way to manually specify the chat format in server mode?

No but you can use the the infill endpoint as shown above in main.

DifferentialityDevelopment commented 6 months ago

Thanks @Jipok! I neglected to mention that I'm using llama.cpp in server mode. Do you know if there is a way to manually specify the chat format in server mode?

The bottom paragraph of https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template makes me think maybe that's not possible. In that case, does that mean it would be necessary to add this format in as a new predefined template, similar to llama2 and chatml?

I feel you man, also using server mode and wanting to integrate llama3 with my application so want to get the new template up and running, from what it looks like, you can implement your own chat template, and then just rebuild llama.cpp, going to try that now

DifferentialityDevelopment commented 6 months ago

This is my rudimentary attempt at quickly trying to add llama-3 chat template support, if it works alright I'll try and make a pull request for it

avioligo commented 6 months ago

Is there any reason not to add a chat template command line argument to ./server? It can be a string like here: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/c9231f629c54de150fe4cca99a98034f32fb589e/tokenizer_config.json#L2053

DifferentialityDevelopment commented 6 months ago

So did a zig build with cuda support (first time ever using zig and wow it's amazing) My code changes seems to work really well, model is responding coherently, anyone that can give me a brief primer on how I'd actually submit the changes as a pull request?

mirek190 commented 6 months ago

template for llama 3 is

also you have to add --reverse-prompt "assistant" as well as is spamming badly ;)

Jipok commented 6 months ago

@mirek190 This is different from what is said in the meta-llama/llama3:

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.

Look at my version above, it seems more correct to me.

mirek190 commented 6 months ago

@mirek190 This is different from what is said in the meta-llama/llama3:

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.

Look at my version above, it seems more correct to me.

According to

https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

My template is correct.

I made test ... This 8b model is insane.... almost level of wizard 2 8x22b! which has 140b parameters....wtf!

llamacpp

Seriously ! wtf

Today I own 3 cars but last year I sold 2 cars. How many cars do I own today?
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> Let's break it down!                                                                                                                                                                                                                                                             You currently own 3 cars.
Last year, you sold 2 cars.

To find the number of cars you owned before selling any, add the current number to the number of cars sold:
3 (current) + 2 (sold) = 5 cars

Since you've already sold those 2 cars, subtract them from the total:
5 - 2 = 3 cars

You still own the same 3 cars that you currently own.

even made this question almost properly! ... forgot about 1 gold coin...

I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> Let's break down the changes you've experienced step by step!

1. You start with 10 apples.
2. You find 3 gold coins in the river, which doesn't affect your apple count.
3. You lose 4 apples, leaving you with 6 apples (10 - 4 = 6).
4. The birds drop 6 apples each, so that's a total of 18 new apples (6 x 3 = 18). Adding these to your remaining apples gives you a total of 24 apples (6 + 18 = 24).
5. You win 6 gold coins in the game but have to share them equally with your teammates. That means you get 2 gold coins (6 ÷ 3 = 2).
6. You spend all the gold coins you have (including the original 3 and the new 2) on apples, since each apple costs 0.5 coins.

Let's calculate how many apples you can buy with your total gold coins:

You have a total of 5 gold coins (3 + 2). Since each apple costs 0.5 coins, you can buy:

5 coins ÷ 0.5 coins/apple = 10 apples

Adding these to the apples you already had, you now have a total of:

24 apples + 10 apples = 34 apples

As for where the river is located, it runs near a big city that allows you to spend the gold coins on something (apples). Since you found the gold coins in the river and then went to play an online game with your teammates, it's likely that the city is nearby or easily accessible.

You now have 34 apples!

look on this! insane! only GPT4 and OPUS can answer it!

Create 10 sentences that ends with a word "apple". Remember the word "apple" MUST be at the end.
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> Here are 10 sentences that end with the word "apple":

1. The farmer picked a juicy red apple.
2. She ate a crunchy snack of peanut butter and apple.
3. The teacher asked her student to bring a small green apple.
4. I love biting into a fresh, crisp autumn apple.
5. Grandma always kept a sweet, ripe Granny Smith apple.
6. The health-conscious person chose a healthy Gala apple.
7. Dad planted a row of seedless Red Delicious apple.
8. In the still life painting, there was a beautiful Golden apple.
9. Mom packed a lunchbox with a juicy Fuji apple and cheese.
10. At the farmer's market, I bought a big, ripe Braeburn apple.

Each sentence ends with the word "apple", as requested!

or this ...

What are the top 10 most famous non-human animal statues?
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> What a unique question!

Here's a list of the top 10 most famous non-human animal statues, in no particular order:

1. **The Lion at the Louvre** (Paris, France): A majestic bronze lion statue guarding the entrance to the world-famous Louvre Museum.
2. **The Charging Bull** (New York City, USA): A iconic bronze sculpture of a bull, symbolizing strength and resilience, located in Lower Manhattan's Financial District.
3. **The Horses of Saint-Mark's Basilica** (Venice, Italy): Four magnificent bronze horses adorning the facade of Saint-Mark's Basilica, a masterpiece by Italian sculptor Barletta.
4. **The Lion of Belfort** (Belfort, France): A massive 15-meter-tall stone lion statue guarding the citadel in eastern France.
5. **The Elephant Gates** (Paris, France): Two colossal bronze elephants flanking the entrance to the Musée d'Orsay, a famous art museum.
6. **The Bronze Horseman** (St. Petersburg, Russia): A stunning equestrian statue of Peter the Great, located in Senate Square, symbolizing Russian history and culture.
7. **The Fountain of the Four Rivers** (Rome, Italy): A majestic bronze statue depicting four rivers (Nile, Ganges, Danube, and Rio Plata) being personified as women, located in the Piazza Navona.
8. **The Lion of Glasgow** (Glasgow, Scotland): A grand stone lion statue standing guard outside the City Chambers building in Glasgow's George Square.
9. **The Horse Guards** (London, UK): Two imposing bronze horse statues situated at the entrance to Buckingham Palace, symbolizing British military prowess.
10. **The Sea Lions of the Golden Gate Bridge** (San Francisco, USA): Two iconic bronze sea lion statues guarding the Golden Gate Bridge, a popular tourist attraction.

insane ...

If I lay 2 wet shirts out in the sun flat on the ground to dry and it takes 5 hours until they are dry, how long would it take to dry 4 shirts of the same type that way simultanously?
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> A great question about drying shirts!

The time it takes for the shirts to dry is influenced by factors like air circulation, temperature, and humidity. Since you're laying them flat on the ground in the sun, we can assume that these conditions are similar for all four shirts.

Since it took 5 hours to dry 2 shirts simultaneously, it's likely that the drying process is not affected by the number of shirts being dried. This means that the same amount of time would be required to dry 4 shirts as well!

So, if it takes 5 hours to dry 2 shirts, it would still take approximately 5 hours to dry all 4 shirts simultaneously

There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?
 <|eot_id|><|start_header_id|>assistant<|end_header_id|> The classic lateral thinking puzzle!

Let's break it down: "two ducks in front of a duck" means that one duck is standing in front of another.

"Two ducks behind a duck" implies that this same duck is being followed by two more ducks.

And, of course, there's the "duck in the middle".

So, if we combine all these clues, we can conclude that there are... 3 ducks!

THAT llama3 8b is INSANE. Better than anything I saw even better than wizard 2 8x22b ! ( question with apple sentence, etc ) That is a huge leap to other dimension for small models performance ... I just do not believe such a small model ( 8b ! ) can be so smart and has a such high reasoning capability.
I am afraid of testing 70b version ;)

EliEron commented 6 months ago

@mirek190 This is different from what is said in the meta-llama/llama3:

The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.

Look at my version above, it seems more correct to me.

According to

https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

My template is cottect.

Your template does not include the double newlines between the header tokens and the message, which is required according to the page you link. That's the main difference between your and Jipok's template.

ddh0 commented 6 months ago

template for llama 3 is

--in-prefix " <|start_header_id|>user<|end_header_id|> " --in-suffix " <|eot_id|><|start_header_id|>assistant<|end_header_id|> " -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability. <|eot_id|> "

also you have to add --reverse-prompt "assistant" as well as is spamming badly ;)

This can't be right, because now the model is not allowed to say the word "assistant".....

mirek190 commented 6 months ago

so, how it should look like for llamacpp? Because for me works fine.

https://github.com/ggerganov/llama.cpp/issues/6747#issuecomment-2065328875

ps you are right this one is better. I changed a bit and works perfect now!

Create 10 sentences that ends with a word "apple". Remember the word "apple" MUST be at the end.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHere are ten sentences that end with the word "apple":

1. I love to eat a juicy red apple.
2. The farmer carefully picked the ripe green apple.
3. She carefully polished her shiny silver apple.
4. He tossed the football high into the crisp autumn air apple.
5. My friend from France brought me a delicious French apple.
6. You can't go wrong with a classic Granny Smith apple.
7. I'm craving a crunchy sweet tart juicy red apple.
8. The little girl held her favorite stuffed cuddly soft teddy bear apple.
9. We had a picnic under the shade of a tall old oak tree apple.
10. The new employee brought in fresh healthy snacks like crunchy granola bars and a crisp Granny Smith apple.

### Output:

I hope this meets your request!

that 8b llama3 is insane ! Everything before llama3 is a trash now...

SamuelTallet commented 6 months ago

This is my rudimentary attempt at quickly trying to add llama-3 chat template support, if it works alright I'll try and make a pull request for it

@DifferentialityDevelopment According to Meta specifications:

After a double newline \n\n the contents of the message follow.

You are missing the \n\n between "<|end_header_id|>" and << trim(message->content)

SamuelTallet commented 6 months ago

anyone that can give me a brief primer on how I'd actually submit the changes as a pull request?

@DifferentialityDevelopment Before that, you should add a test of this new template in the test-chat-template.cpp file. There are details and a Python helper script in this Wiki page. By the way, you should add this new template in the Wiki.

To submit the changes: fork llama.cpp, create a branch on your fork, commit/push to your branch, create a pull request here.

DifferentialityDevelopment commented 6 months ago

This is my rudimentary attempt at quickly trying to add llama-3 chat template support, if it works alright I'll try and make a pull request for it

@DifferentialityDevelopment According to Meta specifications:

After a double newline \n\n the contents of the message follow.

You are missing the \n\n between "<|end_header_id|>" and << trim(message->content)

I've added the missing newlines after end_header_id, thanks for spotting this, also I did add two lines required in test-chat-template.cpp

Pull request is here: https://github.com/ggerganov/llama.cpp/pull/6751

clearsitedesigns commented 6 months ago

Anyone solve this issue - yet - looks like some are past the conversion: assuming I'm missing a BPE tokenizer, but where do you get it at?

FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

Any-Winter-4079 commented 6 months ago

Sorry to steer the topic away from chat templates momentarily, but can anyone share the command to convert the .pth file to e.g. bin? I downloaded through the Meta site instead of Hugging Face, so I have run: python convert.py --outfile models/llama-3-8b/ggml-model-f16.bin --outtype f16 ../llama3/Meta-Llama-3-8B --vocab-type bpe

getting:

    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

And without --vocab-type:

RuntimeError: Internal: /Users/runner/work/sentencepiece/sentencepiece/src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

On ../llama3/Meta-Llama-3-8B I have (besides the checklist):

consolidated.00.pth
params.json
tokenizer.model

I assume the vocab is tokenizer.model using bpe?, but I don't know how to pass it. llama.cpp % python convert.py --outfile models/llama-3-8b/ggml-model-f16.bin --outtype f16 ../llama3/Meta-Llama-3-8B --vocab-dir ../llama3/Meta-Llama-3-8B --vocab-type bpe which should be obvious results again in

    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

Any help is appreciated!

DifferentialityDevelopment commented 6 months ago

Sorry to steer the topic away from chat templates momentarily, but can anyone share the command to convert the .pth file to e.g. bin? I downloaded through the Meta site instead of Hugging Face, so I have run: python convert.py --outfile models/llama-3-8b/ggml-model-f16.bin --outtype f16 ../llama3/Meta-Llama-3-8B --vocab-type bpe

getting:
    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']
And without --vocab-type:
RuntimeError: Internal: /Users/runner/work/sentencepiece/sentencepiece/src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]
On ../llama3/Meta-Llama-3-8B I have (besides the checklist):
consolidated.00.pth
params.json
tokenizer.model
I assume the vocab is tokenizer.model using bpe?, but I don't know how to pass it. llama.cpp % python convert.py --outfile models/llama-3-8b/ggml-model-f16.bin --outtype f16 ../llama3/Meta-Llama-3-8B --vocab-dir ../llama3/Meta-Llama-3-8B --vocab-type bpe which should be obvious results again in
    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']
Any help is appreciated!

This pull request might be of some help https://github.com/ggerganov/llama.cpp/pull/6745

clearsitedesigns commented 6 months ago

Looks like there are still some issues, with the latest ones out on hugginface now ollama, and a few conversions popping up - ollama output

Thought: Since our previous attempt didn! The issue wasn! Will that do the trick?! Please re- Evaluate!<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>I think we're almost there!

DifferentialityDevelopment commented 6 months ago

If anyone else is having the issue with the model not stopping correctly, here is a fixed version of the 8B Instruct GGUF quant https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF

dillonroach commented 6 months ago

The issue had really just been a configuration with the EOS token - folks have started to fix it and release proper quants (e.g. https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/tree/main) seems like all the 'support llama3' issues could be closed.

ddh0 commented 6 months ago

The issue had really just been a configuration with the EOS token - folks have started to fix it and release proper quants (e.g. https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/tree/main) seems like all the 'support llama3' issues could be closed.

IMO, ideally llama.cpp would support using both tokens as EOS tokens, because that's officially the right way to do it.

DifferentialityDevelopment commented 6 months ago

The issue had really just been a configuration with the EOS token - folks have started to fix it and release proper quants (e.g. https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/tree/main) seems like all the 'support llama3' issues could be closed.

IMO, ideally llama.cpp would support using both tokens as EOS tokens, because that's officially the right way to do it.

The one change I did in my pull request does just that

MoonRide303 commented 6 months ago

Updated GGUFs from https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF don't have infinite generation problem - I tested Q6_K, and it works fine with just this Modelfile (I am testing converted GGUF models with Ollama):

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ .Response }}"""
SYSTEM """You are a helpful assistant."""
PARAMETER num_ctx 8192
PARAMETER num_gpu 99

hyperbolic-c commented 6 months ago

If anyone else is having the issue with the model not stopping correctly, here is a fixed version of the 8B Instruct GGUF quant https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF

Thanks for your job！！But I got a strange output error at here, could you have any advices?

DifferentialityDevelopment commented 6 months ago

If anyone else is having the issue with the model not stopping correctly, here is a fixed version of the 8B Instruct GGUF quant https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF

Thanks for your job！！But I got a strange output error at here, could you have any advices?

That quant works perfectly for me, make sure you also using latest version of llama.cpp

My pull request to support the chat template for Llama 3 will hopefully be merged soon (fingers crossed), it also adds the stop string llama 3 uses to utils.hpp but these changes only affect llama.cpp server mode I think

Maybe give some more clarification on how your running the inference, is it via llama.cpp main directly, via LM Studio or something else?

hyperbolic-c commented 6 months ago

@DifferentialityDevelopment Thanks for your reply. I run the inference via llama.cpp main directly like

 ./main -m ../llama3/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

Looking forward to your pull！！

DifferentialityDevelopment commented 6 months ago

@DifferentialityDevelopment Thanks for your reply. I run the inference via llama.cpp main directly like
 ./main -m ../llama3/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
Looking forward to your pull！！

Your context length is way too small, set -n 8192 and it will stop giving giberish output

hyperbolic-c commented 6 months ago

@DifferentialityDevelopment Thanks for your reply. I run the inference via llama.cpp main directly like
 ./main -m ../llama3/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
Looking forward to your pull！！
Your context length is way too small, set -n 8192 and it will stop giving giberish output

thanks a lot!!!, I am new in LLM, it works by set to 8192.

hyperbolic-c commented 6 months ago

@DifferentialityDevelopment I am sorry to bother you. I got the same error after set -n 8192. After I've had two consecutive rounds of dialogue, the problem of repeating the output reappears, could it have something to do with my context length setting with -c N ? or other reason?

Propheticus commented 6 months ago

@hyperbolic-c -n is the max number of output tokens for the generated reply. Context length is indeed set via -c https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#common-options

DifferentialityDevelopment commented 6 months ago

@hyperbolic-c -n is the max number of output tokens for the generated reply. Context length is indeed set via -c https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#common-options

Sorry my mistake :)

ryao commented 6 months ago

This command avoids problems with the end of stream token:

It is based on the recommendation from here:

https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF/discussions/7

Propheticus commented 6 months ago

Is that different than what Jipok said above?

hyperbolic-c commented 6 months ago

@hyperbolic-c -n is the max number of output tokens for the generated reply. Context length is indeed set via -c https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#common-options

Thanks, I knew it. I have also read the README to understand it.

hyperbolic-c commented 6 months ago

@hyperbolic-c -n is the max number of output tokens for the generated reply. Context length is indeed set via -c https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#common-options

Sorry my mistake :)

Oh no no no. Thanks for your help! ! !

ryao commented 6 months ago

Is that different than what Jipok said above?

It is not. There was so much discussion since the last time I looked at this thread yesterday that I missed that.

ggerganov / llama.cpp

llama3 family support #6747