ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.17k stars 9.51k forks source link

Add llama 2 model #2262

Closed tikikun closed 12 months ago

tikikun commented 1 year ago

Meta just released llama 2 model, allowing commercial usage

https://ai.meta.com/resources/models-and-libraries/llama/

I have checked the model implementation and it seems different from llama_v1, maybe need a re-implementation

kharvd commented 1 year ago

Got this working with llama-cpp-python based on https://github.com/facebookresearch/llama/blob/4d92db8a1db6c7f663252bf3477d2c4b8bad2385/llama/generation.py#L212:

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

def make_prompt_llama2(llm, messages: List[Message]) -> List[int]:
    if messages[0]["role"] != "system":
        messages = [
                {
                    "role": "system",
                    "content": DEFAULT_SYSTEM_PROMPT,
                }
        ] + messages
    messages = [
            {
                "role": messages[1]["role"],
                "content": B_SYS
                + messages[0]["content"]
                + E_SYS
                + messages[1]["content"],
            }
    ] + messages[2:]
    assert all([msg["role"] == "user" for msg in messages[::2]]) and all(
        [msg["role"] == "assistant" for msg in messages[1::2]]
    ), (
        "model only supports 'system', 'user' and 'assistant' roles, "
        "starting with 'system', then 'user' and alternating (u/a/u/a/u...)"
    )

    dialog_tokens = sum(
        [
            llm.tokenize(
                bytes(
                    f"{B_INST} {(prompt['content']).strip()} {E_INST} {(answer['content']).strip()} ",
                    "utf-8",
                ),
                add_bos=True,
            )
            + [llm.token_eos()]
            for prompt, answer in zip(
                messages[::2],
                messages[1::2],
            )
        ],
        [],
    )
    assert (
        messages[-1]["role"] == "user"
    ), f"Last message must be from user, got {messages[-1]['role']}"

    dialog_tokens += llm.tokenize(
        bytes(f"{B_INST} {(messages[-1]['content']).strip()} {E_INST}", "utf-8"),
        add_bos=True,
    )

    return dialog_tokens

and then

completion = llm.generate(make_prompt_llama2(llm, [
    {
        "role": "user",
        "content": "Hi",
    },
]), top_p=0.9, temp=0.6, top_k=65535)

for token in completion:
    if token == llm.token_eos():
        break
    print(llm.detokenize([token]).decode("utf-8"), end="")
tmm1 commented 1 year ago

are [INST] and <<SYS>> supposed to be tokens?


    518 -> ' ['                                                                                                             
 25580 -> 'INST'                                                                                                           
 29962 -> ']'                                                                                                              
  3532 -> ' <<'                                                                                                            
 14816 -> 'SY'                                                                                                             
 29903 -> 'S'                                                                                                              
  6778 -> '>>'   
kharvd commented 1 year ago

The llama repo tokenizer seems normal to me with respect to the [INST] and <<SYS>>: https://github.com/facebookresearch/llama/blob/main/llama/tokenizer.py

tmm1 commented 1 year ago

thanks @kharvd. I was able to get your sample working as well via llama-cpp-python

I notice that llm.generate is not able to stream the output, but llm() is. however when I tried to change s/llm.generate/llm/ s/temp/temperature/ it still had some problems

ggerganov commented 1 year ago

I believe an initial version of GQA has been demonstrated by @jploski in (see https://github.com/ggerganov/llama.cpp/issues/1602#issuecomment-1606332010 and related comments). It still needs some work to support BLAS and GPU backends.

BEILOP commented 1 year ago

How many 80g a100's should I have in order to fine-tune the 70b model

philschmid commented 1 year ago

How many 80g a100's should I have in order to fine-tune the 70b model

1x is enough when using QLoRA and int-4

ggerganov commented 1 year ago

I believe an initial version of GQA has been demonstrated by @jploski in (see https://github.com/ggerganov/llama.cpp/issues/1602#issuecomment-1606332010 and related comments). It still needs some work to support BLAS and GPU backends.

Without looking into the details of GQA, and assuming my comment above is correct, I would guess that all it takes to support CPU inference for 70B would be something similar the following change that was made for the Falcon convert script:

https://github.com/jploski/ggml/blob/2e30a2b0356c1f3d589e670523fbca0b342e1438/examples/falcon/convert-hf-to-ggml.py#L105-L124

Again - I could be wrong. When I get access to the model will take a deeper look

philschmid commented 1 year ago

@ggerganov someone uploaded the weights on huggingface without the gate: https://huggingface.co/NousResearch/Llama-2-70b-hf. And its not against the license since it allows redistribution.

Green-Sky commented 1 year ago

We should really push for GGUF before adding GQA, to avoid an extra fileformat.

klosax commented 1 year ago

We should really push for GGUF before adding GQA, to avoid an extra fileformat.

Strong +1 for @Green-Sky suggestion!

There is urgent need for supporting model architectures that performs equal to or better than the original llama. The spec of the new ggml file format are being worked out in https://github.com/ggerganov/ggml/pull/302 for those interested.

Cookie771 commented 1 year ago

Hi ! I'm sorry i'm new on github. I tried to download Llama 2 but it's not working, the cmd's program close without downloading anything after I wrote the model (I've download and install "wget" before, and i don't know how to get "md5sum" on Windows). Can anybody help me please ?

yaroslavyaroslav commented 1 year ago

@ggerganov Just in case, I've got the access to the models download page, and I'll be happy to share it with you if it's still necessary via whatever channel you choose.

Green-Sky commented 1 year ago

The section on Redistribution (1.b.) reads like torrents would be legal under certain conditions.

SlyEcho commented 1 year ago

If you get the email from Facebook with the link, you can also apply to get access to their Hugging Face Hub model pages, if you're using the same email for the HF user registration.

klosax commented 1 year ago

Hi ! I'm sorry i'm new on github. I tried to download Llama 2 but it's not working, the cmd's program close without downloading anything after I wrote the model (I've download and install "wget" before, and i don't know how to get "md5sum" on Windows). Can anybody help me please ?

@Cookie771 You can download quantized ggml files here:

https://huggingface.co/TheBloke/Llama-2-7B-GGML/tree/main https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main

ggerganov commented 1 year ago

I got official access and have already downloaded the models. Thanks

Cookie771 commented 1 year ago

Hi ! I'm sorry i'm new on github. I tried to download Llama 2 but it's not working, the cmd's program close without downloading anything after I wrote the model (I've download and install "wget" before, and i don't know how to get "md5sum" on Windows). Can anybody help me please ?

@Cookie771 You can download quantized ggml files here:

https://huggingface.co/TheBloke/Llama-2-7B-GGML/tree/main https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main

Thanks, is it the chat's version ?

klosax commented 1 year ago

Thanks, is it the chat's version ?

Chat versions:

https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main

ghost commented 1 year ago

It works with 4096 tokens out of the box.

@SlyEcho Hi, I see the model is fine tuned for 4096 context. Are you using RoPe parameters, or what?

alphadl commented 1 year ago

[a not-relevant but important question] has anyone noticed that the max_position_embeddings is 2048 rather than 4096 in the downloaded hf llama2 checkpoints?

SlyEcho commented 1 year ago

I was using @TheBloke's quantized 7B model.

Just passed the args -c 4096 and no scaling and a big file (>3000 tokens) with -f and it was generating coherent text.

ggerganov commented 1 year ago

I think I have a 70B prototype here: https://github.com/ggerganov/llama.cpp/pull/2276

Needs some more work and not 100% sure it is correct, but text generation looks coherent.

wizzard0 commented 1 year ago

Note #2276 breaks non-GQA models:

error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  4096 x   512, got  4096 x  4096
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'llama-2-7b.ggmlv3.q2_K.bin'
main: error: unable to load model
TikkunCreation commented 1 year ago

So the chat model uses something like

{BOS}[INST] <<SYS>>
{system}
<</SYS>>

{instruct-0} [/INST] {response-0} {EOS}{BOS}[INST] {instruct-1} [/INST] {response-1} {EOS}{BOS}[INST] {instruct-N} [/INST]

The model generate EOS automatically, but there's no way to insert BOS with the current code in this repo, neither in main nor in server.

For clarity, it uses <s> and </s> for EOS and BOS (I checked with a python script using tokenizer.model)

jxy commented 1 year ago

I made a simple change to main to add BOS.

diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index bcbcf12..5906cde 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -605,6 +605,8 @@ int main(int argc, char ** argv) {
             // replace end of text token with newline token when in interactive mode
             if (id == llama_token_eos() && params.interactive && !params.instruct) {
                 id = llama_token_newline.front();
+                embd_inp.push_back(llama_token_bos());
+                is_interacting = true;
                 if (params.antiprompt.size() != 0) {
                     // tokenize and inject first reverse prompt
                     const auto first_antiprompt = ::llama_tokenize(ctx, params.antiprompt.front(), false);

and run it like

./main -m "$MODEL" -c 4096 -n -1 --in-prefix ' [INST] ' --in-suffix ' [/INST]' -i -p \
"[INST] <<SYS>>
$SYSTEM
<</SYS>>

$FIRST_MESSAGE [/INST]"

I don't know if we want an argument like --insert-bos-after-eos to main.

Regarding <s> and </s>, main or server cannot encode those to BOS or EOS.

SlyEcho commented 1 year ago

I think inp_pfx and inp_sfx should also be changed?

XiongjieDai commented 1 year ago

Hi ! I'm sorry i'm new on github. I tried to download Llama 2 but it's not working, the cmd's program close without downloading anything after I wrote the model (I've download and install "wget" before, and i don't know how to get "md5sum" on Windows). Can anybody help me please ?

If you have Git Bash installed, you can run the .sh file from the Git Bash command line with: bash path/to/script.sh

jxy commented 1 year ago

I think inp_pfx and inp_sfx should also be changed?

Those are hard coded for the instruct mode

  -ins, --instruct      run in instruction mode (use with Alpaca models)
ziwang-com commented 1 year ago

Global launch, llama2-map module library frame composition

【23-7-20】全球首发,llama2-map模块库架构图 https://github.com/ziwang-com/AGI-MAP

llama2_generation

Green-Sky commented 1 year ago

@ziwang-com those are just callgraphs for the python code. I'm sorry, but the python code already is simple to read as is, we don't really need those images. (also imho they feel harder to read than the python code)

sowa705 commented 1 year ago

I think inp_pfx and inp_sfx should also be changed?

Those are hard coded for the instruct mode

  -ins, --instruct      run in instruction mode (use with Alpaca models)

Would it be possible to move them into the model file? That would solve the issue of different models having different prompt formats

viniciusarruda commented 1 year ago

Is Meta tokenizer identical to llama_cpp tokenizer? I think it should be. But I'm having a issue while decoding/encoding. This is also related to the chat completion format already mentioned above by @kharvd @jxy @TikkunCreation You can see the issue in details and also replicate it here. I'm comparing Meta original tokenizer with a model from @TheBloke .

jxy commented 1 year ago

for llama-2-chat, #2304

jxy commented 1 year ago

and server, #2306

ggerganov commented 1 year ago

70B support should be ready to merge in #2276

Btw, I did some tests with 7Bv2 and the generated texts from short prompts using Q4_0 and Q5_0 definitely feel weird. I wrote more about it in the PR description. Would be nice if other people confirm the observations.

kurnevsky commented 1 year ago

It doesn't work with the following input:

llama-cpp -c 4096 -gqa 8 -t 16 -m llama-2-70b.ggmlv3.q4_K_M.bin -p "### HUMAN:\na\n\n### RESPONSE:\nb\n\n### HUMAN:\nb\n\n### RESPONSE:"

The error is GGML_ASSERT: /build/source/ggml.c:10648: ne02 == ne12.

WiSaGaN commented 1 year ago

The error is GGML_ASSERT: /build/source/ggml.c:10648: ne02 == ne12.

It worked in the vanilla case for me, but got similar error when I run the binary from "make LLAMA_CLBLAST=1". "-gqa 8" was added in both cases.

kurnevsky commented 1 year ago

I actually do use LLAMA_CLBLAST, but tested without gpu offloading - didn't know it affects the execution somehow :) And I got this error on the model from https://huggingface.co/TheBloke/Llama-2-70B-GGML

Nyceane commented 1 year ago

@kurnevsky I am having same problem, are you able to fix it?

cebtenzzre commented 1 year ago

I am having same problem, are you able to fix it?

See #3002. Known workarounds are to not use the OpenCL backend with LLaMA 2, or to not use k-quants (Q*_K).

kleenkanteen commented 12 months ago

@tikikun What do you mean to add the llama 2 model when this repo about the llama model? Also on the main page why does it say "Supported models:" and then lists a bunch of other LLMs when this repo is just about llama?

ggerganov commented 12 months ago

LLaMA v2 and many other models are currently supported by llama.cpp. See the status page for more info

kleenkanteen commented 12 months ago

What do you mean that it's currently supported. Isn't llama.cpp just about llama 1?

On Wed., Oct. 18, 2023, 3:32 a.m. Georgi Gerganov, @.***> wrote:

Closed #2262 https://github.com/ggerganov/llama.cpp/issues/2262 as completed.

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/2262#event-10688717861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQF3ZKSPPHPZHZXUABYEO5DX76AXTAVCNFSM6AAAAAA2OVCXFGVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJQGY4DQNZRG44DMMI . You are receiving this because you commented.Message ID: <ggerganov/llama. @.***>

ggerganov commented 12 months ago

No, llama.cpp can run inference for all model architectures listed in the status page. It started just with LLaMA v1, but since then there has been a lot of progress and it now supports a variety of models.