ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.17k stars 9.51k forks source link

Add llama 2 model #2262

Closed tikikun closed 12 months ago

tikikun commented 1 year ago

Meta just released llama 2 model, allowing commercial usage

https://ai.meta.com/resources/models-and-libraries/llama/

I have checked the model implementation and it seems different from llama_v1, maybe need a re-implementation

Green-Sky commented 1 year ago

link to paper: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Azeirah commented 1 year ago

Interesting to note that the model evaluation section in their paper lists a 34b model even though the site doesn't talk about it. I wonder if it'll be available.

Does anyone have access to the models yet? I signed up but haven't received an e-mail. It's not super clear to me if it's meant to be instant or not.

Green-Sky commented 1 year ago

Interestingly, the paper talks about a 34B model, which is missing from the model card. edit: @Azeirah was faster lol

slaren commented 1 year ago

The paper implies that they are planning to release the 34B model later. image

Green-Sky commented 1 year ago

@Azeirah no, i did not hear back yet either.

Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.

Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden, you can always re-request a link.

also, they are available on hf if your email is the same https://huggingface.co/meta-llama

Azeirah commented 1 year ago

I was really hopeful for an alternative to gpt-4 for coding assistance, but the evaluation states their 70B model is about equivalent in performance to gpt-3.5.

Not bad, but the jump in quality from 3.5 to 4 has been what it made it really useful in day-to-day coding tasks. ;(

Screenshot 2023-07-18 at 19 05 26

At the very least, it does look like the 7B and 13B variants will be amazing local chatbots for low perf devices.

dmadisetti commented 1 year ago

I just got access, but the download is flaky, check sums are not matching and the auth is hit or miss. Notable is the chat specific models:

https://github.com/facebookresearch/llama/blob/main/download.sh#L24C1-L43C7

Will update if I am actually able to download these weights

goranmoomin commented 1 year ago

The updated model code for Llama 2 is at the same facebookresearch/llama repo, diff here: https://github.com/facebookresearch/llama/commit/6d4c0c290aeec1fa4399694fefb864be5a153bb6

Seems codewise, the only difference is the addition of GQA on large models, i.e. the repeat_kv part that repeats the same k/v attention heads on larger models to require less memory for the k/v cache.

According to the paper, smaller models (i.e. the 7b/13b ones) don't have GQA, so in theory it seems it should be able to run unmodified.

dmadisetti commented 1 year ago

Email below with tracking links stripped. Same as llama-1 for the most part. Now if it would actually download.....


You’re all set to start building with Llama 2.

The models listed below are now available to you as a commercial license holder. By downloading a model, you are agreeing to the terms and conditions of the license, acceptable use policy and Meta’s privacy policy.

Model weights available:

Llama-2-7b
Llama-2-7b-chat
Llama-2-13b
Llama-2-13b-chat
Llama-2-70b
Llama-2-70b-chat

With each model download, you’ll receive a copy of the Llama 2 Community License and Acceptable Use Policy, and can find all other information on the model and code on GitHub.

How to download the models:

Visit GitHub and clone [the Llama repository](https://github.com/facebookresearch/llama) from there in order to download the model code
Run the download.sh script and and follow the prompts for downloading the models.
When asked for your unique custom URL, please insert the following:
<redacted for legal reasons>
Select which model weights to download

The unique custom URL provided will remain valid for model downloads for 24 hours, and requests can be submitted multiple times. Now you’re ready to start building with Llama 2.

Helpful tips: Please read the instructions in the GitHub repo and use the provided code examples to understand how to best interact with the models. In particular, for the fine-tuned chat models you must use appropriate formatting and correct system/instruction tokens to get the best results from the model.

You can find additional information about how to responsibly deploy Llama models in our Responsible Use Guide.

If you need to report issues: If you or any Llama 2 user becomes aware of any violation of our license or acceptable use policies - or any bug or issues with Llama 2 that could lead to any such violations - please report it through one of the following means:

Reporting issues with the model: Llama GitHub
Giving feedback about potentially problematic output generated by the model: [Llama output feedback](https://developers.facebook.com/llama_output_feedback)
Reporting bugs and security concerns: [Bug Bounty Program](https://facebook.com/whitehat/info)
Reporting violations of the Acceptable Use Policy: [LlamaUseReport@meta.com](mailto:LlamaUseReport@meta.com)

Subscribe to get the latest updates on Llama and Meta AI.

Meta’s GenAI Team

swyxio commented 1 year ago

anyone else also randomly getting

Resolving download.llamameta.net (download.llamameta.net)... 13.33.88.72, 13.33.88.62, 13.33.88.45, ...
Connecting to download.llamameta.net (download.llamameta.net)|13.33.88.72|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-07-19 01:24:43 ERROR 403: Forbidden.

for the small files? but /llama-2-7b-chat/consolidated.00.pth is downloading fine it seems. will share checksums when i have them

BetaDoggo commented 1 year ago

I tried the 7B and it seems to be working fine, with cuda acceleration as well.

Azeirah commented 1 year ago

anyone else also randomly getting

Resolving download.llamameta.net (download.llamameta.net)... 13.33.88.72, 13.33.88.62, 13.33.88.45, ...
Connecting to download.llamameta.net (download.llamameta.net)|13.33.88.72|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-07-19 01:24:43 ERROR 403: Forbidden.

for the small files? but /llama-2-7b-chat/consolidated.00.pth is downloading fine it seems. will share checksums when i have them

I genuinely just think their servers are a bit overloaded given what I see posted here. It's a big release

trrahul commented 1 year ago

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

Azeirah commented 1 year ago

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

Thebloke is a wizard O_O

Johnhersh commented 1 year ago

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

These worked as-is for me

LoganDark commented 1 year ago

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!

Azeirah commented 1 year ago

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!

Wouldn't be surprised if he's uploading from a service like AWS or Azure, those have insane bandwidth available.

Johnhersh commented 1 year ago

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.

image

LoganDark commented 1 year ago

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!

Wouldn't be surprised if he's uploading from a service like AWS or Azure, those have insane bandwidth available.

As in, renting a VPS or dedicated server just to quantize + upload? (actually, come to think of it, that is an official recommendation by huggingface, wouldn't be surprised...)

LoganDark commented 1 year ago

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.

image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

Azeirah commented 1 year ago

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed. image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p

Johnhersh commented 1 year ago

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed. image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p

Quantized. I'm using llama-2-13b.ggmlv3.q4_1.bin

LoganDark commented 1 year ago

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed. image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p

Quantized. I'm using llama-2-13b.ggmlv3.q4_1.bin

q4_0 should be even faster for only slightly less accuracy

Green-Sky commented 1 year ago

iirc q4_1 has an outdated perf/size tradeoff, use one of the kquants instead. (or q4_0)

nullhook commented 1 year ago

image

inferencing with q4_1 on M1 Max (64GB)

2.99 ms per token is slow

LoganDark commented 1 year ago

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed. image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

huh nevermind

image

(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)

Johnhersh commented 1 year ago

huh nevermind

image

(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)

How do you offload the layers?

SlyEcho commented 1 year ago

What is the prompting format for the chat model?

Green-Sky commented 1 year ago

What is the prompting format for the chat model?

did not test, but https://github.com/facebookresearch/llama/blob/cfc3fc8c1968d390eb830e65c63865e980873a06/llama/generation.py#L44-L49

asgeir commented 1 year ago

What is the prompting format for the chat model?

Looks like it's

[INST] <<SYS>>
Always answer only with emojis
<</SYS>>

How to go from Beijing to NY? [/INST]
TheBloke commented 1 year ago

Wouldn't be surprised if he's uploading from a service like AWS or Azure, those have insane bandwidth available.

As in, renting a VPS or dedicated server just to quantize + upload? (actually, come to think of it, that is an official recommendation by huggingface, wouldn't be surprised...)

Currently using 3 x LambdaLabs H100s

What is the prompting format for the chat model?

Looks like it's

[INST] <<SYS>>
Always answer only with emojis
<</SYS>>

How to go from Beijing to NY? [/INST]

I have put this in my READMEs, based on reading the generation.py

SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
USER: {prompt}
ASSISTANT:
SlyEcho commented 1 year ago

It seems like it also needs BOS and EOS tokens for every message pair

Green-Sky commented 1 year ago

It seems like it also needs BOS and EOS tokens for every message pair

yes, it does https://github.com/facebookresearch/llama/blob/cfc3fc8c1968d390eb830e65c63865e980873a06/llama/generation.py#L248-L251

SlyEcho commented 1 year ago

It works with 4096 tokens out of the box.

netrunnereve commented 1 year ago

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed. image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

huh nevermind

image

(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)

Metal inference is fast but cublas/clblast absolutely smokes Apple on the prompt processing side. With 4K context that number matters much more now.

kacchan-001 commented 1 year ago

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed. image

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

huh nevermind

image

(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)

Can you give me your main command? I couldn't get 13b running on a 3060

wizzard0 commented 1 year ago

70B requires patches for GQA btw (as seen in TheBlokeAI discord) https://github.com/facebookresearch/llama/commit/6d4c0c290aeec1fa4399694fefb864be5a153bb6

SlyEcho commented 1 year ago

Currently works in web chat with these settings:

prompt:

You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

User name: [INS] Bot name: [/INS]

Template:

<<SYS>>
{{prompt}}
<</SYS>>

{{history}} [/INS]

History template:

 {{name}} {{message}}
screenshot ![image](https://github.com/ggerganov/llama.cpp/assets/795193/703148fc-08bb-4834-8490-e77c33fa5d29)

However, it will reevaluate the last response all the time no matter how much I massage the whitespace, which I think is because it is adding the BOS tokens into the response and there is no way to model that right now in the API.

jxy commented 1 year ago

So the chat model uses something like

{BOS}[INST] <<SYS>>
{system}
<</SYS>>

{instruct-0} [/INST] {response-0} {EOS}{BOS}[INST] {instruct-1} [/INST] {response-1} {EOS}{BOS}[INST] {instruct-N} [/INST]

The model generate EOS automatically, but there's no way to insert BOS with the current code in this repo, neither in main nor in server.

oobabooga commented 1 year ago

Quantizing 70B works and generates a GGML, but loading the model fails with this error

error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024

SlyEcho commented 1 year ago

EOS doesn't matter that much since it's in the end. main.cpp could handle it because it doesn't retokenize the history all the time. But currently the formatting is hardcoded for Alpaca-style models.

19h commented 1 year ago

@oobabooga I was able to make it work by

  1. commenting out the if (lt.ne != ne) { block from llama.cpp
  2. removing the GGML_ASSERT(ggml_nelements(a) == ne0);, GGML_ASSERT(ggml_nelements(a) == ne0*ne1);, GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2);, andGGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2*ne3); asserts from ggml.c.
19h commented 1 year ago

Ok nvm, llama is producing garbage with the quantised 70b model..

CleanShot 2023-07-19 at 03 04 26@2x

TheBloke commented 1 year ago

@19h what about at 2048 context?

19h commented 1 year ago

Pretty sure I fucked it up by ignoring the shape discrepancy.. garbage output always happening at all context sizes.

Pretty sure this is due to the Grouped-Query Attention (GQA) that's used with the 70B model as per the paper [0].

[0] https://scontent-fra3-1.xx.fbcdn.net/v/t39.2365-6/10000000_663429262362723_1696968207443577320_n.pdf?_nc_cat=101&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=5ol-jUSglG4AX9ujLdk&_nc_ht=scontent-fra3-1.xx&oh=00_AfC_KCiPbWk1TklOxA_A7j2jVxSbux7fjeoalGZZQ0D4VA&oe=64BBB691

TheBloke commented 1 year ago

OK thanks for the update. Yeah it was a bit surprising that just removing the checks would get it working, but you never know until you try!

19h commented 1 year ago

@ggerganov once you have time to look into this, the paper is referencing this paper elaborating the GQA idea: https://arxiv.org/pdf/2305.13245.pdf.

oobabooga commented 1 year ago

So who will be the hero to implement GQA and send a PR? 😬

19h commented 1 year ago

Trying to figure out the GQA implementation details .. https://github.com/facebookresearch/llama/issues/384.

bullno1 commented 1 year ago

So the chat model uses something like

{BOS}[INST] <<SYS>>
{system}
<</SYS>>

{instruct-0} [/INST] {response-0} {EOS}{BOS}[INST] {instruct-1} [/INST] {response-1} {EOS}{BOS}[INST] {instruct-N} [/INST]

The model generate EOS automatically, but there's no way to insert BOS with the current code in this repo, neither in main nor in server.

You mean from the CLI right? Using the library, BOS is available as llama_token_bos().

It might be possible to cook up something just for this.

Special syntax to express BOS/EOS from the CLI might be a bit of a pain