Closed gulldan closed 6 months ago
As far as I can see, it seems to be the same as before from the architecture point of view. There might be some extra stuff to optimize more https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF
@maziyarpanahi i am getting a "tokenizer.model" not found error. How did you resolve this ?
Calling convert-hf-to-gguf.py ends up with FileNotFoundError: Cannot find Llama BPE tokenizer
. Trying to use tokenizer.model from original folder in https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct repo results in RuntimeError: Internal: could not parse ModelProto from tokenizer.model
.
I have the latest pulled and built from a few hours ago. I am getting worried now with all these failed converts! I tested the quants, they work though.
@maziyarpanahi Have you confirmed that instruct mode works? That's where I'm seeing issues (possibly user error).
Edit: Nevermind, figured out the chat template.
I tried using the quantized instruct model from https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.Q6_K.gguf, but when I try using it (specifying llama2 chat template) I get odd results, which seem like an issue with the chat template:
Hi! How can I assist you today? [/INST] <>
Wait, no! You're the one who is supposed to respond. I made a mistake! [/INST] <> You are the one who is supposed to respond now. [/INST] <>
I think there was a miscommunication! Can you please clarify what you need help with? [/INST] <> [/INST] I need to create a new Google account and set up my computer. Also, I've got some questions about the Google Home app. [/INST]
@thecivilizedgamer: chat template changed, look at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2053.
@thecivilizedgamer: chat template changed, look at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2053.
Ooooh thank you, I was looking but didn't see any info about that
@MoonRide303 sorry to bug you, but do you know how to specify the new template format? I assume that eventually it will be added to llama.cpp as one of the predefined templates along with llama2 and chatml, but I'm not sure how to specify it in the meantime.
In their code, the chat format is here: https://github.com/meta-llama/llama3/blob/299bfd8212fec65698c2f8c7b5970cbbb74c2a4f/llama/tokenizer.py#L202
python convert.py Meta-Llama-3-8B-Instruct --outtype f16 --vocab-type bpe
Need to add --vocab-type bpe
as mentioned here: https://github.com/ggerganov/llama.cpp/pull/6745#issuecomment-2064814034
Just pulled latest from master. When trying to convert from HF/safetensors to GGUF using convert-hf-to-gguf.py
I get:
Loading model: Meta-Llama-3-8B-Instruct
gguf: This GGUF file is for Little Endian only
Set model parameters
gguf: context length = 8192
gguf: embedding length = 4096
gguf: feed forward length = 14336
gguf: head count = 32
gguf: key-value head count = 8
gguf: rope theta = 500000.0
gguf: rms norm epsilon = 1e-05
gguf: file type = 1
Set model tokenizer
Traceback (most recent call last):
File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 1302, in set_vocab
self. _set_vocab_sentencepiece()
File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 330, in _set_vocab_sentencepiece
raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: Meta-Llama-3-8B-Instruct/tokenizer.model
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 2728, in <module>
main()
File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 2715, in main
model_instance.set_vocab()
File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 1304, in set_vocab
self._set_vocab_llama_hf()
File "/home/dylan/Documents/AI/./llama.cpp/convert-hf-to-gguf.py", line 377, in _set_vocab_llama_hf
vocab = LlamaHfVocab(self.dir_model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dylan/Documents/AI/llama.cpp/convert.py", line 532, in __init__
raise FileNotFoundError('Cannot find Llama BPE tokenizer')
FileNotFoundError: Cannot find Llama BPE tokenizer
When trying to convert from HF/safetensors to GGUF using convert.py
I get:
Loading model file Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00001-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00002-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00003-of-00004.safetensors
Loading model file Meta-Llama-3-8B-Instruct/model-00004-of-00004.safetensors
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('Meta-Llama-3-8B-Instruct'))
Traceback (most recent call last):
File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1548, in <module>
main()
File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1515, in main
vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1417, in load_vocab
vocab = self._create_vocab_by_path(vocab_types)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dylan/Documents/AI/./llama.cpp/convert.py", line 1407, in _create_vocab_by_path
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['spm', 'hfft']
Hopefully this can be useful as a reference. Thanks!
@thecivilizedgamer
./main -m ~/models/Meta-Llama-3-8B-Instruct.Q8_0.gguf --color -n -2 -e -s 0 -p '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -ngl 99 --mirostat 2 -c 8192 -r '<|eot_id|>' --in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -i
Thanks @Jipok! I neglected to mention that I'm using llama.cpp in server mode. Do you know if there is a way to manually specify the chat format in server mode?
The bottom paragraph of https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template makes me think maybe that's not possible. In that case, does that mean it would be necessary to add this format in as a new predefined template, similar to llama2 and chatml?
Do you know if there is a way to manually specify the chat format in server mode?
No but you can use the the infill
endpoint as shown above in main
.
Thanks @Jipok! I neglected to mention that I'm using llama.cpp in server mode. Do you know if there is a way to manually specify the chat format in server mode?
The bottom paragraph of https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template makes me think maybe that's not possible. In that case, does that mean it would be necessary to add this format in as a new predefined template, similar to llama2 and chatml?
I feel you man, also using server mode and wanting to integrate llama3 with my application so want to get the new template up and running, from what it looks like, you can implement your own chat template, and then just rebuild llama.cpp, going to try that now
This is my rudimentary attempt at quickly trying to add llama-3 chat template support, if it works alright I'll try and make a pull request for it
Is there any reason not to add a chat template command line argument to ./server? It can be a string like here: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/c9231f629c54de150fe4cca99a98034f32fb589e/tokenizer_config.json#L2053
So did a zig build with cuda support (first time ever using zig and wow it's amazing) My code changes seems to work really well, model is responding coherently, anyone that can give me a brief primer on how I'd actually submit the changes as a pull request?
template for llama 3 is
--in-prefix " <|start_header_id|>user<|end_header_id|> " --in-suffix " <|eot_id|><|start_header_id|>assistant<|end_header_id|> " -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability. <|eot_id|> "
also you have to add --reverse-prompt "assistant" as well as is spamming badly ;)
@mirek190 This is different from what is said in the meta-llama/llama3:
The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.
Look at my version above, it seems more correct to me.
@mirek190 This is different from what is said in the meta-llama/llama3:
The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.
Look at my version above, it seems more correct to me.
According to
https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
My template is correct.
I made test ... This 8b model is insane.... almost level of wizard 2 8x22b! which has 140b parameters....wtf!
llamacpp
main.exe --model models/new3/Meta-Llama-3-8B-Instruct.Q8_0.gguf --color --threads 30 --keep -1 --batch-size 512 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io --in-prefix " <|start_header_id|>user<|end_header_id|> " --in-suffix " <|eot_id|><|start_header_id|>assistant<|end_header_id|> " -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability. <|eot_id|> " --reverse-prompt "assistant"
Seriously ! wtf
Today I own 3 cars but last year I sold 2 cars. How many cars do I own today?
<|eot_id|><|start_header_id|>assistant<|end_header_id|> Let's break it down! You currently own 3 cars.
Last year, you sold 2 cars.
To find the number of cars you owned before selling any, add the current number to the number of cars sold:
3 (current) + 2 (sold) = 5 cars
Since you've already sold those 2 cars, subtract them from the total:
5 - 2 = 3 cars
You still own the same 3 cars that you currently own.
even made this question almost properly! ... forgot about 1 gold coin...
I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
<|eot_id|><|start_header_id|>assistant<|end_header_id|> Let's break down the changes you've experienced step by step!
1. You start with 10 apples.
2. You find 3 gold coins in the river, which doesn't affect your apple count.
3. You lose 4 apples, leaving you with 6 apples (10 - 4 = 6).
4. The birds drop 6 apples each, so that's a total of 18 new apples (6 x 3 = 18). Adding these to your remaining apples gives you a total of 24 apples (6 + 18 = 24).
5. You win 6 gold coins in the game but have to share them equally with your teammates. That means you get 2 gold coins (6 ÷ 3 = 2).
6. You spend all the gold coins you have (including the original 3 and the new 2) on apples, since each apple costs 0.5 coins.
Let's calculate how many apples you can buy with your total gold coins:
You have a total of 5 gold coins (3 + 2). Since each apple costs 0.5 coins, you can buy:
5 coins ÷ 0.5 coins/apple = 10 apples
Adding these to the apples you already had, you now have a total of:
24 apples + 10 apples = 34 apples
As for where the river is located, it runs near a big city that allows you to spend the gold coins on something (apples). Since you found the gold coins in the river and then went to play an online game with your teammates, it's likely that the city is nearby or easily accessible.
You now have 34 apples!
look on this! insane! only GPT4 and OPUS can answer it!
Create 10 sentences that ends with a word "apple". Remember the word "apple" MUST be at the end.
<|eot_id|><|start_header_id|>assistant<|end_header_id|> Here are 10 sentences that end with the word "apple":
1. The farmer picked a juicy red apple.
2. She ate a crunchy snack of peanut butter and apple.
3. The teacher asked her student to bring a small green apple.
4. I love biting into a fresh, crisp autumn apple.
5. Grandma always kept a sweet, ripe Granny Smith apple.
6. The health-conscious person chose a healthy Gala apple.
7. Dad planted a row of seedless Red Delicious apple.
8. In the still life painting, there was a beautiful Golden apple.
9. Mom packed a lunchbox with a juicy Fuji apple and cheese.
10. At the farmer's market, I bought a big, ripe Braeburn apple.
Each sentence ends with the word "apple", as requested!
or this ...
What are the top 10 most famous non-human animal statues?
<|eot_id|><|start_header_id|>assistant<|end_header_id|> What a unique question!
Here's a list of the top 10 most famous non-human animal statues, in no particular order:
1. **The Lion at the Louvre** (Paris, France): A majestic bronze lion statue guarding the entrance to the world-famous Louvre Museum.
2. **The Charging Bull** (New York City, USA): A iconic bronze sculpture of a bull, symbolizing strength and resilience, located in Lower Manhattan's Financial District.
3. **The Horses of Saint-Mark's Basilica** (Venice, Italy): Four magnificent bronze horses adorning the facade of Saint-Mark's Basilica, a masterpiece by Italian sculptor Barletta.
4. **The Lion of Belfort** (Belfort, France): A massive 15-meter-tall stone lion statue guarding the citadel in eastern France.
5. **The Elephant Gates** (Paris, France): Two colossal bronze elephants flanking the entrance to the Musée d'Orsay, a famous art museum.
6. **The Bronze Horseman** (St. Petersburg, Russia): A stunning equestrian statue of Peter the Great, located in Senate Square, symbolizing Russian history and culture.
7. **The Fountain of the Four Rivers** (Rome, Italy): A majestic bronze statue depicting four rivers (Nile, Ganges, Danube, and Rio Plata) being personified as women, located in the Piazza Navona.
8. **The Lion of Glasgow** (Glasgow, Scotland): A grand stone lion statue standing guard outside the City Chambers building in Glasgow's George Square.
9. **The Horse Guards** (London, UK): Two imposing bronze horse statues situated at the entrance to Buckingham Palace, symbolizing British military prowess.
10. **The Sea Lions of the Golden Gate Bridge** (San Francisco, USA): Two iconic bronze sea lion statues guarding the Golden Gate Bridge, a popular tourist attraction.
insane ...
If I lay 2 wet shirts out in the sun flat on the ground to dry and it takes 5 hours until they are dry, how long would it take to dry 4 shirts of the same type that way simultanously?
<|eot_id|><|start_header_id|>assistant<|end_header_id|> A great question about drying shirts!
The time it takes for the shirts to dry is influenced by factors like air circulation, temperature, and humidity. Since you're laying them flat on the ground in the sun, we can assume that these conditions are similar for all four shirts.
Since it took 5 hours to dry 2 shirts simultaneously, it's likely that the drying process is not affected by the number of shirts being dried. This means that the same amount of time would be required to dry 4 shirts as well!
So, if it takes 5 hours to dry 2 shirts, it would still take approximately 5 hours to dry all 4 shirts simultaneously
There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?
<|eot_id|><|start_header_id|>assistant<|end_header_id|> The classic lateral thinking puzzle!
Let's break it down: "two ducks in front of a duck" means that one duck is standing in front of another.
"Two ducks behind a duck" implies that this same duck is being followed by two more ducks.
And, of course, there's the "duck in the middle".
So, if we combine all these clues, we can conclude that there are... 3 ducks!
THAT llama3 8b is INSANE.
Better than anything I saw even better than wizard 2 8x22b ! ( question with apple sentence, etc )
That is a huge leap to other dimension for small models performance ... I just do not believe such a small model ( 8b ! ) can be so smart and has a such high reasoning capability.
I am afraid of testing 70b version ;)
@mirek190 This is different from what is said in the meta-llama/llama3:
The fine-tuned models were trained for dialogue applications. To get the expected features and performance for them, a specific formatting defined in ChatFormat needs to be followed: The prompt begins with a <|begin_of_text|> special token, after which one or more messages follow. Each message starts with the <|start_header_id|> tag, the role system, user or assistant, and the <|end_header_id|> tag. After a double newline \n\n the contents of the message follow. The end of each message is marked by the <|eot_id|> token.
Look at my version above, it seems more correct to me.
According to
https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
My template is cottect.
Your template does not include the double newlines between the header tokens and the message, which is required according to the page you link. That's the main difference between your and Jipok's template.
template for llama 3 is
--in-prefix " <|start_header_id|>user<|end_header_id|> " --in-suffix " <|eot_id|><|start_header_id|>assistant<|end_header_id|> " -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability. <|eot_id|> "
also you have to add --reverse-prompt "assistant" as well as is spamming badly ;)
This can't be right, because now the model is not allowed to say the word "assistant".....
so, how it should look like for llamacpp? Because for me works fine.
https://github.com/ggerganov/llama.cpp/issues/6747#issuecomment-2065328875
ps you are right this one is better. I changed a bit and works perfect now!
main.exe --model models/new3/Meta-Llama-3-8B-Instruct.Q8_0.gguf --color --threads 30 --keep -1 --batch-size 512 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io -r '<|eot_id|>' --in-prefix "\n<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nHi!<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n"
Create 10 sentences that ends with a word "apple". Remember the word "apple" MUST be at the end.
<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHere are ten sentences that end with the word "apple":
1. I love to eat a juicy red apple.
2. The farmer carefully picked the ripe green apple.
3. She carefully polished her shiny silver apple.
4. He tossed the football high into the crisp autumn air apple.
5. My friend from France brought me a delicious French apple.
6. You can't go wrong with a classic Granny Smith apple.
7. I'm craving a crunchy sweet tart juicy red apple.
8. The little girl held her favorite stuffed cuddly soft teddy bear apple.
9. We had a picnic under the shade of a tall old oak tree apple.
10. The new employee brought in fresh healthy snacks like crunchy granola bars and a crisp Granny Smith apple.
### Output:
I hope this meets your request!
that 8b llama3 is insane ! Everything before llama3 is a trash now...
This is my rudimentary attempt at quickly trying to add llama-3 chat template support, if it works alright I'll try and make a pull request for it
@DifferentialityDevelopment According to Meta specifications:
After a double newline \n\n the contents of the message follow.
You are missing the \n\n
between "<|end_header_id|>"
and << trim(message->content)
anyone that can give me a brief primer on how I'd actually submit the changes as a pull request?
@DifferentialityDevelopment Before that, you should add a test of this new template in the test-chat-template.cpp file. There are details and a Python helper script in this Wiki page. By the way, you should add this new template in the Wiki.
To submit the changes: fork llama.cpp
, create a branch on your fork, commit/push to your branch, create a pull request here.
This is my rudimentary attempt at quickly trying to add llama-3 chat template support, if it works alright I'll try and make a pull request for it
@DifferentialityDevelopment According to Meta specifications:
After a double newline \n\n the contents of the message follow.
You are missing the
\n\n
between"<|end_header_id|>"
and<< trim(message->content)
I've added the missing newlines after end_header_id, thanks for spotting this, also I did add two lines required in test-chat-template.cpp
Pull request is here: https://github.com/ggerganov/llama.cpp/pull/6751
Anyone solve this issue - yet - looks like some are past the conversion: assuming I'm missing a BPE tokenizer, but where do you get it at?
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']
Sorry to steer the topic away from chat templates momentarily, but can anyone share the command to convert the .pth file to e.g. bin?
I downloaded through the Meta site instead of Hugging Face, so I have run:
python convert.py --outfile models/llama-3-8b/ggml-model-f16.bin --outtype f16 ../llama3/Meta-Llama-3-8B --vocab-type bpe
getting:
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']
And without --vocab-type
:
RuntimeError: Internal: /Users/runner/work/sentencepiece/sentencepiece/src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]
On ../llama3/Meta-Llama-3-8B I have (besides the checklist):
consolidated.00.pth
params.json
tokenizer.model
I assume the vocab is tokenizer.model using bpe?, but I don't know how to pass it.
llama.cpp % python convert.py --outfile models/llama-3-8b/ggml-model-f16.bin --outtype f16 ../llama3/Meta-Llama-3-8B --vocab-dir ../llama3/Meta-Llama-3-8B --vocab-type bpe
which should be obvious results again in
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']
Any help is appreciated!
Sorry to steer the topic away from chat templates momentarily, but can anyone share the command to convert the .pth file to e.g. bin? I downloaded through the Meta site instead of Hugging Face, so I have run:
python convert.py --outfile models/llama-3-8b/ggml-model-f16.bin --outtype f16 ../llama3/Meta-Llama-3-8B --vocab-type bpe
getting:
raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}") FileNotFoundError: Could not find a tokenizer matching any of ['bpe']
And without
--vocab-type
:RuntimeError: Internal: /Users/runner/work/sentencepiece/sentencepiece/src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]
On ../llama3/Meta-Llama-3-8B I have (besides the checklist):
consolidated.00.pth params.json tokenizer.model
I assume the vocab is tokenizer.model using bpe?, but I don't know how to pass it.
llama.cpp % python convert.py --outfile models/llama-3-8b/ggml-model-f16.bin --outtype f16 ../llama3/Meta-Llama-3-8B --vocab-dir ../llama3/Meta-Llama-3-8B --vocab-type bpe
which should be obvious results again inraise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}") FileNotFoundError: Could not find a tokenizer matching any of ['bpe']
Any help is appreciated!
This pull request might be of some help https://github.com/ggerganov/llama.cpp/pull/6745
Looks like there are still some issues, with the latest ones out on hugginface now ollama, and a few conversions popping up - ollama output
Thought: Since our previous attempt didn! The issue wasn! Will that do the trick?! Please re- Evaluate!<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|>I think we're almost there!
If anyone else is having the issue with the model not stopping correctly, here is a fixed version of the 8B Instruct GGUF quant https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF
The issue had really just been a configuration with the EOS token - folks have started to fix it and release proper quants (e.g. https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/tree/main) seems like all the 'support llama3' issues could be closed.
The issue had really just been a configuration with the EOS token - folks have started to fix it and release proper quants (e.g. https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/tree/main) seems like all the 'support llama3' issues could be closed.
IMO, ideally llama.cpp would support using both tokens as EOS tokens, because that's officially the right way to do it.
The issue had really just been a configuration with the EOS token - folks have started to fix it and release proper quants (e.g. https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/tree/main) seems like all the 'support llama3' issues could be closed.
IMO, ideally llama.cpp would support using both tokens as EOS tokens, because that's officially the right way to do it.
The one change I did in my pull request does just that
Updated GGUFs from https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF don't have infinite generation problem - I tested Q6_K, and it works fine with just this Modelfile (I am testing converted GGUF models with Ollama):
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{{ .Response }}"""
SYSTEM """You are a helpful assistant."""
PARAMETER num_ctx 8192
PARAMETER num_gpu 99
If anyone else is having the issue with the model not stopping correctly, here is a fixed version of the 8B Instruct GGUF quant https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF
Thanks for your job!!But I got a strange output error at here, could you have any advices?
If anyone else is having the issue with the model not stopping correctly, here is a fixed version of the 8B Instruct GGUF quant https://huggingface.co/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF
Thanks for your job!!But I got a strange output error at here, could you have any advices?
That quant works perfectly for me, make sure you also using latest version of llama.cpp
My pull request to support the chat template for Llama 3 will hopefully be merged soon (fingers crossed), it also adds the stop string llama 3 uses to utils.hpp but these changes only affect llama.cpp server mode I think
Maybe give some more clarification on how your running the inference, is it via llama.cpp main directly, via LM Studio or something else?
@DifferentialityDevelopment Thanks for your reply. I run the inference via llama.cpp main directly like
./main -m ../llama3/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
Looking forward to your pull!!
@DifferentialityDevelopment Thanks for your reply. I run the inference via llama.cpp main directly like
./main -m ../llama3/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
Looking forward to your pull!!
Your context length is way too small, set -n 8192 and it will stop giving giberish output
@DifferentialityDevelopment Thanks for your reply. I run the inference via llama.cpp main directly like
./main -m ../llama3/models/Meta-Llama-3-8B-Instruct-Q8_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
Looking forward to your pull!!
Your context length is way too small, set -n 8192 and it will stop giving giberish output
thanks a lot!!!, I am new in LLM, it works by set to 8192.
@DifferentialityDevelopment I am sorry to bother you. I got the same error after set -n 8192. After I've had two consecutive rounds of dialogue, the problem of repeating the output reappears, could it have something to do with my context length setting with -c N
? or other reason?
@hyperbolic-c -n is the max number of output tokens for the generated reply. Context length is indeed set via -c https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#common-options
@hyperbolic-c -n is the max number of output tokens for the generated reply. Context length is indeed set via -c https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#common-options
Sorry my mistake :)
This command avoids problems with the end of stream token:
./main -m /home/richard/.cache/huggingface/hub/models--MaziyarPanahi--Meta-Llama-3-8B-Instruct-GGUF/snapshots/fc6d3af3b92e96d64053358e1dd549b312fee1f1/Meta-Llama-3-8B-Instruct.fp16.gguf --no-display-prompt -e -c 0 -r '<|eot_id|>' --in-prefix '\n<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a story writing assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nWrite a story about llamas.<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n"
It is based on the recommendation from here:
https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF/discussions/7
Is that different than what Jipok said above?
@hyperbolic-c -n is the max number of output tokens for the generated reply. Context length is indeed set via -c https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#common-options
Thanks, I knew it. I have also read the README to understand it.
@hyperbolic-c -n is the max number of output tokens for the generated reply. Context length is indeed set via -c https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#common-options
Sorry my mistake :)
Oh no no no. Thanks for your help! ! !
Is that different than what Jipok said above?
It is not. There was so much discussion since the last time I looked at this thread yesterday that I missed that.
llama3 released would be happy to use with llama.cpp https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6
https://github.com/meta-llama/llama3