Closed tikikun closed 12 months ago
Got this working with llama-cpp-python based on https://github.com/facebookresearch/llama/blob/4d92db8a1db6c7f663252bf3477d2c4b8bad2385/llama/generation.py#L212:
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""
def make_prompt_llama2(llm, messages: List[Message]) -> List[int]:
if messages[0]["role"] != "system":
messages = [
{
"role": "system",
"content": DEFAULT_SYSTEM_PROMPT,
}
] + messages
messages = [
{
"role": messages[1]["role"],
"content": B_SYS
+ messages[0]["content"]
+ E_SYS
+ messages[1]["content"],
}
] + messages[2:]
assert all([msg["role"] == "user" for msg in messages[::2]]) and all(
[msg["role"] == "assistant" for msg in messages[1::2]]
), (
"model only supports 'system', 'user' and 'assistant' roles, "
"starting with 'system', then 'user' and alternating (u/a/u/a/u...)"
)
dialog_tokens = sum(
[
llm.tokenize(
bytes(
f"{B_INST} {(prompt['content']).strip()} {E_INST} {(answer['content']).strip()} ",
"utf-8",
),
add_bos=True,
)
+ [llm.token_eos()]
for prompt, answer in zip(
messages[::2],
messages[1::2],
)
],
[],
)
assert (
messages[-1]["role"] == "user"
), f"Last message must be from user, got {messages[-1]['role']}"
dialog_tokens += llm.tokenize(
bytes(f"{B_INST} {(messages[-1]['content']).strip()} {E_INST}", "utf-8"),
add_bos=True,
)
return dialog_tokens
and then
completion = llm.generate(make_prompt_llama2(llm, [
{
"role": "user",
"content": "Hi",
},
]), top_p=0.9, temp=0.6, top_k=65535)
for token in completion:
if token == llm.token_eos():
break
print(llm.detokenize([token]).decode("utf-8"), end="")
are [INST]
and <<SYS>>
supposed to be tokens?
518 -> ' ['
25580 -> 'INST'
29962 -> ']'
3532 -> ' <<'
14816 -> 'SY'
29903 -> 'S'
6778 -> '>>'
The llama repo tokenizer seems normal to me with respect to the [INST]
and <<SYS>>
: https://github.com/facebookresearch/llama/blob/main/llama/tokenizer.py
thanks @kharvd. I was able to get your sample working as well via llama-cpp-python
I notice that llm.generate
is not able to stream the output, but llm()
is. however when I tried to change s/llm.generate/llm/ s/temp/temperature/
it still had some problems
I believe an initial version of GQA has been demonstrated by @jploski in (see https://github.com/ggerganov/llama.cpp/issues/1602#issuecomment-1606332010 and related comments). It still needs some work to support BLAS and GPU backends.
How many 80g a100's should I have in order to fine-tune the 70b model
How many 80g a100's should I have in order to fine-tune the 70b model
1x is enough when using QLoRA and int-4
I believe an initial version of GQA has been demonstrated by @jploski in (see https://github.com/ggerganov/llama.cpp/issues/1602#issuecomment-1606332010 and related comments). It still needs some work to support BLAS and GPU backends.
Without looking into the details of GQA, and assuming my comment above is correct, I would guess that all it takes to support CPU inference for 70B would be something similar the following change that was made for the Falcon convert script:
Again - I could be wrong. When I get access to the model will take a deeper look
@ggerganov someone uploaded the weights on huggingface without the gate: https://huggingface.co/NousResearch/Llama-2-70b-hf. And its not against the license since it allows redistribution.
We should really push for GGUF before adding GQA, to avoid an extra fileformat.
We should really push for GGUF before adding GQA, to avoid an extra fileformat.
Strong +1 for @Green-Sky suggestion!
There is urgent need for supporting model architectures that performs equal to or better than the original llama. The spec of the new ggml file format are being worked out in https://github.com/ggerganov/ggml/pull/302 for those interested.
Hi ! I'm sorry i'm new on github. I tried to download Llama 2 but it's not working, the cmd's program close without downloading anything after I wrote the model (I've download and install "wget" before, and i don't know how to get "md5sum" on Windows). Can anybody help me please ?
@ggerganov Just in case, I've got the access to the models download page, and I'll be happy to share it with you if it's still necessary via whatever channel you choose.
The section on Redistribution (1.b.) reads like torrents would be legal under certain conditions.
If you get the email from Facebook with the link, you can also apply to get access to their Hugging Face Hub model pages, if you're using the same email for the HF user registration.
Hi ! I'm sorry i'm new on github. I tried to download Llama 2 but it's not working, the cmd's program close without downloading anything after I wrote the model (I've download and install "wget" before, and i don't know how to get "md5sum" on Windows). Can anybody help me please ?
@Cookie771 You can download quantized ggml files here:
https://huggingface.co/TheBloke/Llama-2-7B-GGML/tree/main https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main
I got official access and have already downloaded the models. Thanks
Hi ! I'm sorry i'm new on github. I tried to download Llama 2 but it's not working, the cmd's program close without downloading anything after I wrote the model (I've download and install "wget" before, and i don't know how to get "md5sum" on Windows). Can anybody help me please ?
@Cookie771 You can download quantized ggml files here:
https://huggingface.co/TheBloke/Llama-2-7B-GGML/tree/main https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main
Thanks, is it the chat's version ?
Thanks, is it the chat's version ?
Chat versions:
https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main
It works with 4096 tokens out of the box.
@SlyEcho Hi, I see the model is fine tuned for 4096 context. Are you using RoPe parameters, or what?
[a not-relevant but important question] has anyone noticed that the max_position_embeddings
is 2048 rather than 4096 in the downloaded hf llama2 checkpoints?
I was using @TheBloke's quantized 7B model.
Just passed the args -c 4096
and no scaling and a big file (>3000 tokens) with -f
and it was generating coherent text.
I think I have a 70B prototype here: https://github.com/ggerganov/llama.cpp/pull/2276
Needs some more work and not 100% sure it is correct, but text generation looks coherent.
Note #2276 breaks non-GQA models:
error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 4096 x 512, got 4096 x 4096
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'llama-2-7b.ggmlv3.q2_K.bin'
main: error: unable to load model
So the chat model uses something like
{BOS}[INST] <<SYS>> {system} <</SYS>> {instruct-0} [/INST] {response-0} {EOS}{BOS}[INST] {instruct-1} [/INST] {response-1} {EOS}{BOS}[INST] {instruct-N} [/INST]
The model generate EOS automatically, but there's no way to insert BOS with the current code in this repo, neither in main nor in server.
For clarity, it uses <s>
and </s>
for EOS and BOS (I checked with a python script using tokenizer.model)
I made a simple change to main to add BOS.
diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index bcbcf12..5906cde 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -605,6 +605,8 @@ int main(int argc, char ** argv) {
// replace end of text token with newline token when in interactive mode
if (id == llama_token_eos() && params.interactive && !params.instruct) {
id = llama_token_newline.front();
+ embd_inp.push_back(llama_token_bos());
+ is_interacting = true;
if (params.antiprompt.size() != 0) {
// tokenize and inject first reverse prompt
const auto first_antiprompt = ::llama_tokenize(ctx, params.antiprompt.front(), false);
and run it like
./main -m "$MODEL" -c 4096 -n -1 --in-prefix ' [INST] ' --in-suffix ' [/INST]' -i -p \
"[INST] <<SYS>>
$SYSTEM
<</SYS>>
$FIRST_MESSAGE [/INST]"
I don't know if we want an argument like --insert-bos-after-eos
to main.
Regarding <s>
and </s>
, main or server cannot encode those to BOS or EOS.
I think inp_pfx
and inp_sfx
should also be changed?
Hi ! I'm sorry i'm new on github. I tried to download Llama 2 but it's not working, the cmd's program close without downloading anything after I wrote the model (I've download and install "wget" before, and i don't know how to get "md5sum" on Windows). Can anybody help me please ?
If you have Git Bash installed, you can run the .sh file from the Git Bash command line with: bash
path/to/script.sh
I think
inp_pfx
andinp_sfx
should also be changed?
Those are hard coded for the instruct mode
-ins, --instruct run in instruction mode (use with Alpaca models)
Global launch, llama2-map module library frame composition
【23-7-20】全球首发,llama2-map模块库架构图 https://github.com/ziwang-com/AGI-MAP
@ziwang-com those are just callgraphs for the python code. I'm sorry, but the python code already is simple to read as is, we don't really need those images. (also imho they feel harder to read than the python code)
I think
inp_pfx
andinp_sfx
should also be changed?Those are hard coded for the instruct mode
-ins, --instruct run in instruction mode (use with Alpaca models)
Would it be possible to move them into the model file? That would solve the issue of different models having different prompt formats
Is Meta tokenizer identical to llama_cpp tokenizer? I think it should be. But I'm having a issue while decoding/encoding. This is also related to the chat completion format already mentioned above by @kharvd @jxy @TikkunCreation You can see the issue in details and also replicate it here. I'm comparing Meta original tokenizer with a model from @TheBloke .
for llama-2-chat, #2304
and server, #2306
70B support should be ready to merge in #2276
Btw, I did some tests with 7Bv2 and the generated texts from short prompts using Q4_0
and Q5_0
definitely feel weird. I wrote more about it in the PR description. Would be nice if other people confirm the observations.
It doesn't work with the following input:
llama-cpp -c 4096 -gqa 8 -t 16 -m llama-2-70b.ggmlv3.q4_K_M.bin -p "### HUMAN:\na\n\n### RESPONSE:\nb\n\n### HUMAN:\nb\n\n### RESPONSE:"
The error is GGML_ASSERT: /build/source/ggml.c:10648: ne02 == ne12
.
The error is GGML_ASSERT: /build/source/ggml.c:10648: ne02 == ne12.
It worked in the vanilla case for me, but got similar error when I run the binary from "make LLAMA_CLBLAST=1". "-gqa 8" was added in both cases.
I actually do use LLAMA_CLBLAST
, but tested without gpu offloading - didn't know it affects the execution somehow :)
And I got this error on the model from https://huggingface.co/TheBloke/Llama-2-70B-GGML
@kurnevsky I am having same problem, are you able to fix it?
I am having same problem, are you able to fix it?
See #3002. Known workarounds are to not use the OpenCL backend with LLaMA 2, or to not use k-quants (Q*_K).
@tikikun What do you mean to add the llama 2 model when this repo about the llama model? Also on the main page why does it say "Supported models:" and then lists a bunch of other LLMs when this repo is just about llama?
LLaMA v2 and many other models are currently supported by llama.cpp
.
See the status page for more info
What do you mean that it's currently supported. Isn't llama.cpp just about llama 1?
On Wed., Oct. 18, 2023, 3:32 a.m. Georgi Gerganov, @.***> wrote:
Closed #2262 https://github.com/ggerganov/llama.cpp/issues/2262 as completed.
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/2262#event-10688717861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQF3ZKSPPHPZHZXUABYEO5DX76AXTAVCNFSM6AAAAAA2OVCXFGVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJQGY4DQNZRG44DMMI . You are receiving this because you commented.Message ID: <ggerganov/llama. @.***>
No, llama.cpp
can run inference for all model architectures listed in the status page. It started just with LLaMA v1, but since then there has been a lot of progress and it now supports a variety of models.
Meta just released llama 2 model, allowing commercial usage
https://ai.meta.com/resources/models-and-libraries/llama/
I have checked the model implementation and it seems different from llama_v1, maybe need a re-implementation