abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.72k stars 930 forks source link

Different generation in llama.cpp and llama-cpp-python #619

Open ssemeniuta opened 1 year ago

ssemeniuta commented 1 year ago

Expected Behavior

I train a model with the transformers lib, then convert it to llama.cpp format using convert.py from llama.cpp. Then, as a sanity check, I compare generation by transformers, llama.cpp and llama-cpp-python. I use f32 models in llama.cpp and llama-cpp-python. I configure all three to decode greedily, picking top 1 token at every step by setting top-k=1 and setting repeat_penalty=1.0

I found that transformers and llama-cpp-python produce 100% same results while those of llama.cpp binaries differ. Perhaps there are generation parameters which default values differ for llama.cpp and llama-cpp-python? If not what could cause this discrepancy?

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

$ lscpu
$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    1
Core(s) per socket:    64
Socket(s):             2
NUMA node(s):          2
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7662 64-Core Processor
Stepping:              0
CPU MHz:               2100.865
CPU max MHz:           2000,0000
CPU min MHz:           1500,0000
BogoMIPS:              3999.91
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-63
NUMA node1 CPU(s):     64-127
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
$ nvidia-smi -L
GPU 0: A100-PCIE-40GB (UUID: GPU-1d02b89e-9be9-ece5-472c-8ec1790ffdbc)
$ uname -a
Linux hostname 5.4.164-1.el7.elrepo.x86_64 #1 SMP Mon Dec 6 12:28:33 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.2.1
Built for x86_64-pc-linux-gnu
$ g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
Solaxun commented 1 year ago

I've noticed this as well - I think it is the prompt template. What prompt format are you using? If it's following the example in the High Level API overview:

>>> from llama_cpp import Llama
>>> llm = Llama(model_path="./models/7B/ggml-model.bin")
>>> output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
>>> print(output) 

The "Q: A:" is going to impact the results.

I've also seen this used by libraries that wrap llama-cpp-python:

<s>[INST] <<SYS>>
*SYSTEM PROMPT HERE*
<</SYS>>

*USER PROMPT HERE* [/INST]

In my experience this prompt format tends to produce more censored results, even if you specify the system prompt as something along the lines of "be direct, unfiltered, and as accurate as possible".

Try removing the template entirely and just asking your question directly, e.g: llm("what is your favorite color?")

That got me results that were identical to llama.cpp.

rlleshi commented 1 year ago

@Solaxun Interesting. I get widely different responses to llama.cpp with this wrapper using the 13b version. And the responses are ramblings most of the time. Tried different promptings (including no prompting). But it seldom helps.

Solaxun commented 1 year ago

Try setting the temp to zero. Here is an example I just tried, terminal is llama.cpp, vscode is llama-cpp-python:

image

Purely anecdotal from my own experiments - but whenever I use any of the common "prompt templates" I get different results. I don't think llama.cpp is applying any templates, it's just passing the prompt through as-is. So you need to be consistent in what you feed both, and setting the temp to zero makes it deterministic.

rlleshi commented 1 year ago

Yep, I've set the temp to 0. Have you tested llama-2-13b.ggmlv3.q4_0 as well?

It's not just different responses actually. It's gibberish responses when using the wrapper.

Solaxun commented 1 year ago

That's what I'm using - llama-2-13b-chat.ggmlv3.q4_0.bin

Can you paste an example of both calls?

Here is mine: llama-cpp-python:

llm = Llama(
    model_path="/Users/solaxun/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin",
    n_gpu_layers=1,
    verbose=False
    )
output = llm('what is galbitang?', max_tokens=4000, echo=True,temperature=0)
print(output['choices'][0]['text'])

llama.cpp:

./main -i \
--threads 8 \
--n-gpu-layers 1 \
--model llama-2-13b-chat.ggmlv3.q4_0.bin \
--color \
--ctx-size 2048 \
--temp 0.0 \
--repeat_penalty 1.1 \
--n-predict -1 \
-p "what is galbitang?"     

I forgot to align the context size calls but it didn't matter in this case.

rlleshi commented 1 year ago

Here's a run for the same code with the wrapper (llama.cpp is working as expected anyway). The only difference is the num GPU layers:

>>> llm = Llama(model_path="models/llama-2-13b.ggmlv3.q4_0.bin", n_gpu_layers=43, verbose=False)
>>> output = llm('what is galbitang?', max_tokens=4000, echo=True,temperature=0)
>>> print(output['choices'][0]['text'])
what is galbitang?
Galbitang (갈비탕) is a Korean beef stew made with brisket, vegetables and rice cakes. It’s a hearty soup that’s perfect for cold winter days.
what is galbi tang?
Galbi Tang is a traditional Korean dish made from beef ribs. The meat is boiled in water until it becomes tender, then the broth is strained and served with rice or noodles. It can also be eaten as an appetizer or main course.
what is galbitang made of?
Galbitang is a Korean dish that consists of beef ribs, vegetables, and rice cakes. The ingredients are boiled together in water to create a hearty soup.
what is galbi tang made of?
Galbi Tang is a traditional Korean dish made from beef ribs. It is usually served with rice or noodles, but can also be eaten as an appetizer or main course. The ingredients are boiled together in water to create a hearty soup.
what is galbitang good for?
Galbitang is a Korean dish that consists of beef ribs, vegetables, and rice cakes. It’s typically served as an appetizer or main course with rice or noodles. The ingredients are boiled together in water to create a hearty soup.
what is galbi tang good for?
Galbitang is a Korean dish that consists of beef ribs, vegetables, and rice cakes. It’s typically served as an appetizer or main course with rice or noodles. The ingredients are boiled together in water to create a hearty soup. Galbi Tang can also be eaten as an appetizer or main course.

Seems like the number of GPU layers makes a difference. Using 1 layer:

what is galbitang?
Galbi Tang (갈비탕) is a Korean beef soup made with ribs. It’s a hearty, rich broth that’s perfect for cold winter days. The dish is traditionally served in a large bowl and eaten with rice or noodles.
Galbi Tang is one of the most popular soups in Korea. It’s made from beef ribs, which are boiled until they become tender. Then, they’re added to a broth that contains vegetables such as carrots, onions, and potatoes. The soup is usually served with rice or noodles.
Galbi Tang is often served at special occasions like weddings and funerals. It’s also used as a comfort food during times of sadness or stress.
how to make galbitang?
Galbi Tang (갈비탕) is a Korean beef soup made with ribs. It’s a hearty, rich broth that’s perfect for cold winter days. The dish is traditionally served in a large bowl and eaten with rice or noodles. Galbi Tang is one of the most popular soups in Korea. It’s made from beef ribs, which are boiled until they become tender. Then, they’re added to a broth that contains vegetables such as carrots, onions, and potatoes. The soup is usually served with rice or noodles. Galbi Tang is often served at special occasions like weddings and funerals. It’s also used as a comfort food during times of sadness or stress.
Galbitang is a Korean beef soup made with ribs. It’s a hearty, rich broth that’s perfect for cold winter days. The dish is traditionally served in a large bowl and eaten with rice or noodles. Galbi Tang is one of the most popular soups in Korea. It’s made from beef ribs, which are boiled until they become tender. Then, they’re added to a broth that contains vegetables such as carrots, onions, and potatoes. The soup is usually served with rice or noodles. Galbi Tang is often served at special occasions like weddings and funerals.

cc @abetlen

Solaxun commented 1 year ago

When you say "working as expected" what does that mean? Same result as my screenshot? I'm not sure if there should be any guarantee of consistency across different machines.

If you run with one layer (your bottom screenshot) is it different from the cpp output also using one layer? Make sure the arguments to both cpp and the wrapper are consistent. Are you 100% positive you are using the exact same model (same filepath) or do you have two different copies downloaded?

rlleshi commented 1 year ago

Working as expected means in general that the answer is a gpt-3.5-ish level answer, which is what llama2 (even 13B) is supposed to give.

In contrast, you can see that when using this wrapper the model rambles quite a bit (cf. above answers), and sometimes it doesn't even answer the questions.

Just one example below (with temp = 0):

Prompt: What is 10+10 - 100?

llama.cpp: The calculation is as follows:\n\n10 + 10 = 20\n\n20 - 100 = -80 (if temp > 0, still correct answers. But with different ways of expressing it.) llama-cpp-python: 20 (if temp > 0 more gibberish responses. Sometimes responses that are totally off, e.g. python code, will be given.)

Yep, it is the same model from TheBloke at huggingface. I'm specifically even giving it the same parameters as llama.cpp for those parameters that are consistent throughout both repos (well actually the default values of the web server that this repo provides are the same as the default values of llama.cpp as far as I can tell).

For example, to get the above responses, I just ran your code snippet.

And yes, llama.cpp will give a coherent and consistent response whether it runs fully on cpu, partly on GPU, or fully on GPU.

Solaxun commented 1 year ago

Hmm .. not sure what to tell you. I tried your example and at first I was getting slightly diff results b/w llama.cpp and llama-cpp-python, but I realized I had capitalized "What" in one example and not the other. Every detail matters with prompts. After making sure they were identical this is what I get in both:

image

Curious to see if you get to the bottom of this, but honestly I'm not sure what's causing the diff for you.

rlleshi commented 1 year ago

But even your response is a bit off.

I'm having trouble with this problem. Can someone please help me understand how to solve it?
Thank you!

The model shouldn't start the answer like that. Unless you are using a specific prompt telling it to behave in this way?

One more question for you. Is your model deployed on llama-cpp-python capable of handling complex prompts? Because my model also works sort of okay-ish with even simpler QA answers.

Could you please test the following prompt?

"\nGiven the following email delimited by triple backticks extract the first and last name of the recipient of the email as a json.\nDo not include titles or honorifics in the names. Do not assume a first name and last name if they don't belong to the recipient but return null values instead.\n\nOnly return the json data as a response.\n\nEmail content: ```Hello Alexander Schon,\n\nIt's been a while since we caught up! I was going through some old photos and stumbled upon that hilarious snap from our college camping trip. Ah, the memories!\n\nHow have you been? I heard you've moved to New York – that's fantastic. We should definitely meet up if I'm ever in the city or if you're visiting here.\n\nDrop me a message and let's plan something. Looking forward to hearing all your stories.\n\nTake care,\n\nSam```\n"

llama.cpp returns the expected json response. Not at all the case for the wrapper.

Solaxun commented 1 year ago

I'm only speaking to the consistency between the cpp lib and this one, which for my samples has been identical (so far). As far as getting the output to be exactly how you want? That's a problem for every LLM out there, and is a function of your prompt, the model, and the parameters.

I had previously included the argument echo=True which returns your prompt in addition to the response. If your remove that (default is False) you will get just the response. Here is what your example returned for me, using your exact prompt, no other text. You could argue that you don't want it to say "expected output", and maybe with enough prompt experimentation you could accomplish that - but still, pretty good as a first attempt.

image

We should probably continue this discussion in a chat of some sort to avoid spamming the maintainers. Feel free to ping me on Discord, same username (Solaxun).

rlleshi commented 1 year ago

Yeah, prompt engineering is another topic. I'm just saying the responses that I get amount to ramblings/garbage a lot of the time, regardless of prompt used/hyperparameters tuned (using the defaults, or maybe more conservative ones).

No, this response is quite good. I get gibberish output as a response for this specific query with llama-cpp-python. And have tested on two machines! Very odd.

But many thanks for your help.

lktinh2018 commented 1 year ago

Sorry all, I got same issue with openai server docker build. Input: { "messages": [ { "content": "You are a helpful assistant.", "role": "system" }, { "content": "Q: Do you know Apple company? A: ", "role": "user" } ] } Output: { "id": "chatcmpl-0bbe41ad-a770-4fdf-a73d-d874a0a6dcca", "object": "chat.completion", "created": 1692689471, "model": "/models/llama-2-13b-chat.ggmlv3.q8_0.bin", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Yes sir/ girlfriendshipshipshipshipitalized italian ItalyFLAGging" }, "finish_reason": "length" } ], "usage": { "prompt_tokens": 29, "completion_tokens": 16, "total_tokens": 45 } }

Run again and output: { "id": "chatcmpl-fe4811fe-7ce3-4d7f-9d7b-06140a97a5b3", "object": "chat.completion", "created": 1692690218, "model": "/models/llama-2-13b-chat.ggmlv3.q8_0.bin", "choices": [ { "index": 0, "message": { "role": "assistant", "content": " yes yes!ccoordinates geometryändersonally recommendations recommendations recommendation recommendations" }, "finish_reason": "length" } ], "usage": { "prompt_tokens": 29, "completion_tokens": 16, "total_tokens": 45 } }