Closed tikikun closed 12 months ago
Interesting to note that the model evaluation section in their paper lists a 34b model even though the site doesn't talk about it. I wonder if it'll be available.
Does anyone have access to the models yet? I signed up but haven't received an e-mail. It's not super clear to me if it's meant to be instant or not.
Interestingly, the paper talks about a 34B model, which is missing from the model card. edit: @Azeirah was faster lol
The paper implies that they are planning to release the 34B model later.
@Azeirah no, i did not hear back yet either.
Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.
Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden, you can always re-request a link.
also, they are available on hf if your email is the same https://huggingface.co/meta-llama
I was really hopeful for an alternative to gpt-4 for coding assistance, but the evaluation states their 70B model is about equivalent in performance to gpt-3.5.
Not bad, but the jump in quality from 3.5 to 4 has been what it made it really useful in day-to-day coding tasks. ;(
At the very least, it does look like the 7B and 13B variants will be amazing local chatbots for low perf devices.
I just got access, but the download is flaky, check sums are not matching and the auth is hit or miss. Notable is the chat specific models:
https://github.com/facebookresearch/llama/blob/main/download.sh#L24C1-L43C7
Will update if I am actually able to download these weights
The updated model code for Llama 2 is at the same facebookresearch/llama repo, diff here: https://github.com/facebookresearch/llama/commit/6d4c0c290aeec1fa4399694fefb864be5a153bb6
Seems codewise, the only difference is the addition of GQA on large models, i.e. the repeat_kv
part that repeats the same k/v attention heads on larger models to require less memory for the k/v cache.
According to the paper, smaller models (i.e. the 7b/13b ones) don't have GQA, so in theory it seems it should be able to run unmodified.
Email below with tracking links stripped. Same as llama-1 for the most part. Now if it would actually download.....
You’re all set to start building with Llama 2.
The models listed below are now available to you as a commercial license holder. By downloading a model, you are agreeing to the terms and conditions of the license, acceptable use policy and Meta’s privacy policy.
Model weights available:
Llama-2-7b
Llama-2-7b-chat
Llama-2-13b
Llama-2-13b-chat
Llama-2-70b
Llama-2-70b-chat
With each model download, you’ll receive a copy of the Llama 2 Community License and Acceptable Use Policy, and can find all other information on the model and code on GitHub.
How to download the models:
Visit GitHub and clone [the Llama repository](https://github.com/facebookresearch/llama) from there in order to download the model code
Run the download.sh script and and follow the prompts for downloading the models.
When asked for your unique custom URL, please insert the following:
<redacted for legal reasons>
Select which model weights to download
The unique custom URL provided will remain valid for model downloads for 24 hours, and requests can be submitted multiple times. Now you’re ready to start building with Llama 2.
Helpful tips: Please read the instructions in the GitHub repo and use the provided code examples to understand how to best interact with the models. In particular, for the fine-tuned chat models you must use appropriate formatting and correct system/instruction tokens to get the best results from the model.
You can find additional information about how to responsibly deploy Llama models in our Responsible Use Guide.
If you need to report issues: If you or any Llama 2 user becomes aware of any violation of our license or acceptable use policies - or any bug or issues with Llama 2 that could lead to any such violations - please report it through one of the following means:
Reporting issues with the model: Llama GitHub
Giving feedback about potentially problematic output generated by the model: [Llama output feedback](https://developers.facebook.com/llama_output_feedback)
Reporting bugs and security concerns: [Bug Bounty Program](https://facebook.com/whitehat/info)
Reporting violations of the Acceptable Use Policy: [LlamaUseReport@meta.com](mailto:LlamaUseReport@meta.com)
Subscribe to get the latest updates on Llama and Meta AI.
Meta’s GenAI Team
anyone else also randomly getting
Resolving download.llamameta.net (download.llamameta.net)... 13.33.88.72, 13.33.88.62, 13.33.88.45, ...
Connecting to download.llamameta.net (download.llamameta.net)|13.33.88.72|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-07-19 01:24:43 ERROR 403: Forbidden.
for the small files? but /llama-2-7b-chat/consolidated.00.pth
is downloading fine it seems. will share checksums when i have them
I tried the 7B and it seems to be working fine, with cuda acceleration as well.
anyone else also randomly getting
Resolving download.llamameta.net (download.llamameta.net)... 13.33.88.72, 13.33.88.62, 13.33.88.45, ... Connecting to download.llamameta.net (download.llamameta.net)|13.33.88.72|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2023-07-19 01:24:43 ERROR 403: Forbidden.
for the small files? but
/llama-2-7b-chat/consolidated.00.pth
is downloading fine it seems. will share checksums when i have them
I genuinely just think their servers are a bit overloaded given what I see posted here. It's a big release
Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML
Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML
Thebloke is a wizard O_O
Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML
These worked as-is for me
Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML
Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!
Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML
Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!
Wouldn't be surprised if he's uploading from a service like AWS or Azure, those have insane bandwidth available.
It works, but it is veeeeery slow in silicon macs.
Hmm really? On the 13B one I get crazy-good speed.
Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML
Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!
Wouldn't be surprised if he's uploading from a service like AWS or Azure, those have insane bandwidth available.
As in, renting a VPS or dedicated server just to quantize + upload? (actually, come to think of it, that is an official recommendation by huggingface, wouldn't be surprised...)
It works, but it is veeeeery slow in silicon macs.
Hmm really? On the 13B one I get crazy-good speed.
Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)
It works, but it is veeeeery slow in silicon macs.
Hmm really? On the 13B one I get crazy-good speed.
Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)
Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p
It works, but it is veeeeery slow in silicon macs.
Hmm really? On the 13B one I get crazy-good speed.
Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)
Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p
Quantized. I'm using llama-2-13b.ggmlv3.q4_1.bin
It works, but it is veeeeery slow in silicon macs.
Hmm really? On the 13B one I get crazy-good speed.
Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)
Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p
Quantized. I'm using llama-2-13b.ggmlv3.q4_1.bin
q4_0 should be even faster for only slightly less accuracy
iirc q4_1 has an outdated perf/size tradeoff, use one of the kquants instead. (or q4_0)
inferencing with q4_1 on M1 Max (64GB)
2.99 ms per token
is slow
It works, but it is veeeeery slow in silicon macs.
Hmm really? On the 13B one I get crazy-good speed.
Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)
huh nevermind
(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)
huh nevermind
(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)
How do you offload the layers?
What is the prompting format for the chat model?
What is the prompting format for the chat model?
did not test, but https://github.com/facebookresearch/llama/blob/cfc3fc8c1968d390eb830e65c63865e980873a06/llama/generation.py#L44-L49
What is the prompting format for the chat model?
Looks like it's
[INST] <<SYS>>
Always answer only with emojis
<</SYS>>
How to go from Beijing to NY? [/INST]
Wouldn't be surprised if he's uploading from a service like AWS or Azure, those have insane bandwidth available.
As in, renting a VPS or dedicated server just to quantize + upload? (actually, come to think of it, that is an official recommendation by huggingface, wouldn't be surprised...)
Currently using 3 x LambdaLabs H100s
What is the prompting format for the chat model?
Looks like it's
[INST] <<SYS>> Always answer only with emojis <</SYS>> How to go from Beijing to NY? [/INST]
I have put this in my READMEs, based on reading the generation.py
SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
USER: {prompt}
ASSISTANT:
It seems like it also needs BOS
and EOS
tokens for every message pair
It seems like it also needs
BOS
andEOS
tokens for every message pair
yes, it does https://github.com/facebookresearch/llama/blob/cfc3fc8c1968d390eb830e65c63865e980873a06/llama/generation.py#L248-L251
It works with 4096 tokens out of the box.
It works, but it is veeeeery slow in silicon macs.
Hmm really? On the 13B one I get crazy-good speed.
Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)
huh nevermind
(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)
Metal inference is fast but cublas/clblast absolutely smokes Apple on the prompt processing side. With 4K context that number matters much more now.
It works, but it is veeeeery slow in silicon macs.
Hmm really? On the 13B one I get crazy-good speed.
Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)
huh nevermind
(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)
Can you give me your main command? I couldn't get 13b running on a 3060
70B requires patches for GQA btw (as seen in TheBlokeAI discord) https://github.com/facebookresearch/llama/commit/6d4c0c290aeec1fa4399694fefb864be5a153bb6
Currently works in web chat with these settings:
prompt:
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
User name: [INS]
Bot name: [/INS]
Template:
<<SYS>>
{{prompt}}
<</SYS>>
{{history}} [/INS]
History template:
{{name}} {{message}}
However, it will reevaluate the last response all the time no matter how much I massage the whitespace, which I think is because it is adding the BOS
tokens into the response and there is no way to model that right now in the API.
So the chat model uses something like
{BOS}[INST] <<SYS>>
{system}
<</SYS>>
{instruct-0} [/INST] {response-0} {EOS}{BOS}[INST] {instruct-1} [/INST] {response-1} {EOS}{BOS}[INST] {instruct-N} [/INST]
The model generate EOS automatically, but there's no way to insert BOS with the current code in this repo, neither in main nor in server.
Quantizing 70B works and generates a GGML, but loading the model fails with this error
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024
EOS doesn't matter that much since it's in the end. main.cpp could handle it because it doesn't retokenize the history all the time. But currently the formatting is hardcoded for Alpaca-style models.
@oobabooga I was able to make it work by
if (lt.ne != ne) {
block from llama.cpp
GGML_ASSERT(ggml_nelements(a) == ne0);
, GGML_ASSERT(ggml_nelements(a) == ne0*ne1);
, GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2);
, andGGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2*ne3);
asserts from ggml.c
.Ok nvm, llama is producing garbage with the quantised 70b model..
@19h what about at 2048 context?
Pretty sure I fucked it up by ignoring the shape discrepancy.. garbage output always happening at all context sizes.
Pretty sure this is due to the Grouped-Query Attention (GQA) that's used with the 70B model as per the paper [0].
OK thanks for the update. Yeah it was a bit surprising that just removing the checks would get it working, but you never know until you try!
@ggerganov once you have time to look into this, the paper is referencing this paper elaborating the GQA idea: https://arxiv.org/pdf/2305.13245.pdf.
So who will be the hero to implement GQA and send a PR? 😬
Trying to figure out the GQA implementation details .. https://github.com/facebookresearch/llama/issues/384.
So the chat model uses something like
{BOS}[INST] <<SYS>> {system} <</SYS>> {instruct-0} [/INST] {response-0} {EOS}{BOS}[INST] {instruct-1} [/INST] {response-1} {EOS}{BOS}[INST] {instruct-N} [/INST]
The model generate EOS automatically, but there's no way to insert BOS with the current code in this repo, neither in main nor in server.
You mean from the CLI right?
Using the library, BOS is available as llama_token_bos()
.
It might be possible to cook up something just for this.
Special syntax to express BOS/EOS from the CLI might be a bit of a pain
Meta just released llama 2 model, allowing commercial usage
https://ai.meta.com/resources/models-and-libraries/llama/
I have checked the model implementation and it seems different from llama_v1, maybe need a re-implementation