Open ssemeniuta opened 1 year ago
I've noticed this as well - I think it is the prompt template. What prompt format are you using? If it's following the example in the High Level API overview:
>>> from llama_cpp import Llama
>>> llm = Llama(model_path="./models/7B/ggml-model.bin")
>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
>>> print(output)
The "Q:
I've also seen this used by libraries that wrap llama-cpp-python:
<s>[INST] <<SYS>>
*SYSTEM PROMPT HERE*
<</SYS>>
*USER PROMPT HERE* [/INST]
In my experience this prompt format tends to produce more censored results, even if you specify the system prompt as something along the lines of "be direct, unfiltered, and as accurate as possible".
Try removing the template entirely and just asking your question directly, e.g:
llm("what is your favorite color?")
That got me results that were identical to llama.cpp.
@Solaxun Interesting. I get widely different responses to llama.cpp with this wrapper using the 13b version. And the responses are ramblings most of the time. Tried different promptings (including no prompting). But it seldom helps.
Try setting the temp to zero. Here is an example I just tried, terminal is llama.cpp, vscode is llama-cpp-python:
Purely anecdotal from my own experiments - but whenever I use any of the common "prompt templates" I get different results. I don't think llama.cpp is applying any templates, it's just passing the prompt through as-is. So you need to be consistent in what you feed both, and setting the temp to zero makes it deterministic.
Yep, I've set the temp to 0. Have you tested llama-2-13b.ggmlv3.q4_0
as well?
It's not just different responses actually. It's gibberish responses when using the wrapper.
That's what I'm using - llama-2-13b-chat.ggmlv3.q4_0.bin
Can you paste an example of both calls?
Here is mine: llama-cpp-python:
llm = Llama(
model_path="/Users/solaxun/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin",
n_gpu_layers=1,
verbose=False
)
output = llm('what is galbitang?', max_tokens=4000, echo=True,temperature=0)
print(output['choices'][0]['text'])
llama.cpp:
./main -i \
--threads 8 \
--n-gpu-layers 1 \
--model llama-2-13b-chat.ggmlv3.q4_0.bin \
--color \
--ctx-size 2048 \
--temp 0.0 \
--repeat_penalty 1.1 \
--n-predict -1 \
-p "what is galbitang?"
I forgot to align the context size calls but it didn't matter in this case.
Here's a run for the same code with the wrapper (llama.cpp is working as expected anyway). The only difference is the num GPU layers:
>>> llm = Llama(model_path="models/llama-2-13b.ggmlv3.q4_0.bin", n_gpu_layers=43, verbose=False)
>>> output = llm('what is galbitang?', max_tokens=4000, echo=True,temperature=0)
>>> print(output['choices'][0]['text'])
what is galbitang?
Galbitang (갈비탕) is a Korean beef stew made with brisket, vegetables and rice cakes. It’s a hearty soup that’s perfect for cold winter days.
what is galbi tang?
Galbi Tang is a traditional Korean dish made from beef ribs. The meat is boiled in water until it becomes tender, then the broth is strained and served with rice or noodles. It can also be eaten as an appetizer or main course.
what is galbitang made of?
Galbitang is a Korean dish that consists of beef ribs, vegetables, and rice cakes. The ingredients are boiled together in water to create a hearty soup.
what is galbi tang made of?
Galbi Tang is a traditional Korean dish made from beef ribs. It is usually served with rice or noodles, but can also be eaten as an appetizer or main course. The ingredients are boiled together in water to create a hearty soup.
what is galbitang good for?
Galbitang is a Korean dish that consists of beef ribs, vegetables, and rice cakes. It’s typically served as an appetizer or main course with rice or noodles. The ingredients are boiled together in water to create a hearty soup.
what is galbi tang good for?
Galbitang is a Korean dish that consists of beef ribs, vegetables, and rice cakes. It’s typically served as an appetizer or main course with rice or noodles. The ingredients are boiled together in water to create a hearty soup. Galbi Tang can also be eaten as an appetizer or main course.
Seems like the number of GPU layers makes a difference. Using 1 layer:
what is galbitang?
Galbi Tang (갈비탕) is a Korean beef soup made with ribs. It’s a hearty, rich broth that’s perfect for cold winter days. The dish is traditionally served in a large bowl and eaten with rice or noodles.
Galbi Tang is one of the most popular soups in Korea. It’s made from beef ribs, which are boiled until they become tender. Then, they’re added to a broth that contains vegetables such as carrots, onions, and potatoes. The soup is usually served with rice or noodles.
Galbi Tang is often served at special occasions like weddings and funerals. It’s also used as a comfort food during times of sadness or stress.
how to make galbitang?
Galbi Tang (갈비탕) is a Korean beef soup made with ribs. It’s a hearty, rich broth that’s perfect for cold winter days. The dish is traditionally served in a large bowl and eaten with rice or noodles. Galbi Tang is one of the most popular soups in Korea. It’s made from beef ribs, which are boiled until they become tender. Then, they’re added to a broth that contains vegetables such as carrots, onions, and potatoes. The soup is usually served with rice or noodles. Galbi Tang is often served at special occasions like weddings and funerals. It’s also used as a comfort food during times of sadness or stress.
Galbitang is a Korean beef soup made with ribs. It’s a hearty, rich broth that’s perfect for cold winter days. The dish is traditionally served in a large bowl and eaten with rice or noodles. Galbi Tang is one of the most popular soups in Korea. It’s made from beef ribs, which are boiled until they become tender. Then, they’re added to a broth that contains vegetables such as carrots, onions, and potatoes. The soup is usually served with rice or noodles. Galbi Tang is often served at special occasions like weddings and funerals.
cc @abetlen
When you say "working as expected" what does that mean? Same result as my screenshot? I'm not sure if there should be any guarantee of consistency across different machines.
If you run with one layer (your bottom screenshot) is it different from the cpp output also using one layer? Make sure the arguments to both cpp and the wrapper are consistent. Are you 100% positive you are using the exact same model (same filepath) or do you have two different copies downloaded?
Working as expected means in general that the answer is a gpt-3.5-ish level answer, which is what llama2 (even 13B) is supposed to give.
In contrast, you can see that when using this wrapper the model rambles quite a bit (cf. above answers), and sometimes it doesn't even answer the questions.
Just one example below (with temp = 0):
Prompt
: What is 10+10 - 100?
llama.cpp
: The calculation is as follows:\n\n10 + 10 = 20\n\n20 - 100 = -80 (if temp > 0, still correct answers. But with different ways of expressing it.)
llama-cpp-python
: 20 (if temp > 0 more gibberish responses. Sometimes responses that are totally off, e.g. python code, will be given.)
Yep, it is the same model from TheBloke at huggingface. I'm specifically even giving it the same parameters as llama.cpp for those parameters that are consistent throughout both repos (well actually the default values of the web server that this repo provides are the same as the default values of llama.cpp as far as I can tell).
For example, to get the above responses, I just ran your code snippet.
And yes, llama.cpp will give a coherent and consistent response whether it runs fully on cpu, partly on GPU, or fully on GPU.
Hmm .. not sure what to tell you. I tried your example and at first I was getting slightly diff results b/w llama.cpp and llama-cpp-python, but I realized I had capitalized "What" in one example and not the other. Every detail matters with prompts. After making sure they were identical this is what I get in both:
Curious to see if you get to the bottom of this, but honestly I'm not sure what's causing the diff for you.
But even your response is a bit off.
I'm having trouble with this problem. Can someone please help me understand how to solve it?
Thank you!
The model shouldn't start the answer like that. Unless you are using a specific prompt telling it to behave in this way?
One more question for you. Is your model deployed on llama-cpp-python
capable of handling complex prompts? Because my model also works sort of okay-ish with even simpler QA answers.
Could you please test the following prompt?
"\nGiven the following email delimited by triple backticks extract the first and last name of the recipient of the email as a json.\nDo not include titles or honorifics in the names. Do not assume a first name and last name if they don't belong to the recipient but return null values instead.\n\nOnly return the json data as a response.\n\nEmail content: ```Hello Alexander Schon,\n\nIt's been a while since we caught up! I was going through some old photos and stumbled upon that hilarious snap from our college camping trip. Ah, the memories!\n\nHow have you been? I heard you've moved to New York – that's fantastic. We should definitely meet up if I'm ever in the city or if you're visiting here.\n\nDrop me a message and let's plan something. Looking forward to hearing all your stories.\n\nTake care,\n\nSam```\n"
llama.cpp
returns the expected json response. Not at all the case for the wrapper.
I'm only speaking to the consistency between the cpp lib and this one, which for my samples has been identical (so far). As far as getting the output to be exactly how you want? That's a problem for every LLM out there, and is a function of your prompt, the model, and the parameters.
I had previously included the argument echo=True
which returns your prompt in addition to the response. If your remove that (default is False) you will get just the response. Here is what your example returned for me, using your exact prompt, no other text. You could argue that you don't want it to say "expected output", and maybe with enough prompt experimentation you could accomplish that - but still, pretty good as a first attempt.
We should probably continue this discussion in a chat of some sort to avoid spamming the maintainers. Feel free to ping me on Discord, same username (Solaxun).
Yeah, prompt engineering is another topic. I'm just saying the responses that I get amount to ramblings/garbage a lot of the time, regardless of prompt used/hyperparameters tuned (using the defaults, or maybe more conservative ones).
No, this response is quite good. I get gibberish output as a response for this specific query with llama-cpp-python. And have tested on two machines! Very odd.
But many thanks for your help.
Sorry all, I got same issue with openai server docker build. Input: { "messages": [ { "content": "You are a helpful assistant.", "role": "system" }, { "content": "Q: Do you know Apple company? A: ", "role": "user" } ] } Output: { "id": "chatcmpl-0bbe41ad-a770-4fdf-a73d-d874a0a6dcca", "object": "chat.completion", "created": 1692689471, "model": "/models/llama-2-13b-chat.ggmlv3.q8_0.bin", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Yes sir/ girlfriendshipshipshipshipitalized italian ItalyFLAGging" }, "finish_reason": "length" } ], "usage": { "prompt_tokens": 29, "completion_tokens": 16, "total_tokens": 45 } }
Run again and output: { "id": "chatcmpl-fe4811fe-7ce3-4d7f-9d7b-06140a97a5b3", "object": "chat.completion", "created": 1692690218, "model": "/models/llama-2-13b-chat.ggmlv3.q8_0.bin", "choices": [ { "index": 0, "message": { "role": "assistant", "content": " yes yes!ccoordinates geometryändersonally recommendations recommendations recommendation recommendations" }, "finish_reason": "length" } ], "usage": { "prompt_tokens": 29, "completion_tokens": 16, "total_tokens": 45 } }
Expected Behavior
I train a model with the transformers lib, then convert it to llama.cpp format using
convert.py
from llama.cpp. Then, as a sanity check, I compare generation by transformers, llama.cpp and llama-cpp-python. I use f32 models in llama.cpp and llama-cpp-python. I configure all three to decode greedily, picking top 1 token at every step by settingtop-k=1
and settingrepeat_penalty=1.0
I found that transformers and llama-cpp-python produce 100% same results while those of llama.cpp binaries differ. Perhaps there are generation parameters which default values differ for llama.cpp and llama-cpp-python? If not what could cause this discrepancy?
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.