abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.02k stars 951 forks source link

Differing outputs for llama.cpp and llama-cpp-python on gemma-2b-it q4 #1235

Closed zhiyong9654 closed 7 months ago

zhiyong9654 commented 8 months ago

I'm facing some reproducibility issues with llama.cpp vs llama-cpp-python, on the same quantized model from lmstudio-ai.

Here's a reproducible copy:

from transformers import AutoTokenizer
from llama_cpp import Llama
import os

HUGGINGFACE_TOKEN = os.environ.get("HUGGINGFACE_TOKEN")

# Run huggingface-cli download lmstudio-ai/gemma-2b-it-GGUF gemma-2b-it-q4_k_m.gguf --local-dir . --local-dir-use-symlinks False
llm_path = "./gemma-2b-it-q4_k_m.gguf"

text = """Extract the dishes sold in a python list:'The tofu prawn sauce was amazing, the Samsui chicken was still as good as I remembered it. Service was friendly and efficient.', 'It have been ages since we last dine here. Service is prompt and all dishes today is nice and if you order their samsui chicken, you can order another side dishes at $3.20. 3 pax total up of $76 only. Quite worth it. Overall is an pleasant dining experience.'"""

chat = [{"role": "user", "content": text}]
tokenizer = AutoTokenizer.from_pretrained("gg-hf/gemma-2b-it", token=HUGGINGFACE_TOKEN)
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

llm = Llama(model_path=llm_path)
print(prompt)
print(llm(prompt))

Returns:

{
  'id': 'cmpl-157668be-8ec5-40cc-8c0c-7655cd1f46f5', 
  'object': 'text_completion', 
  'created': 1709217359, 
  'model': './gemma-2b-it-q4_k_m.gguf', 
  'choices': [{'text': 'The description:\n\nThe reviews are very friendly and efficient and delicious.\n\nThe', 
  'index': 0, 
  'logprobs': None, 
  'finish_reason': 'length'}], 
  'usage': {'prompt_tokens': 109, 'completion_tokens': 16, 'total_tokens': 125}
}

Using llama.cpp on the same model yields more sensible results:

 ./main -m ../../google-maps-api/gemma-2b-it-q4_k_m.gguf --prompt "<bos><start_of_turn>user
Extract the dishes sold in a python list:'The tofu prawn sauce was amazing, the Samsui chicken was still as good as I remembered it. Service was friendly and efficient.', 'It have been ages since we last dine here. Service is prompt and all dishes today is nice and if you order their samsui chicken, you can order another side dishes at $3.20. 3 pax total up of $76 only. Quite worth it. Overall is an pleasant dining experience.'<end_of_turn>
<start_of_turn>model"

# Outputs:
Sure, here is a list of dishes from the text:

- Tofu prawn sauce
- Samsui chicken
- Side dishes [end of text]

Versions:

llama_cpp_python==0.2.51 Python 3.10.13 Both llama.cpp and llama_cpp_python running in WSL2

Tried

  1. I tried a simpler prompt: "What's the capital of France":
    1. llama.cpp: The capital city of France is Paris. It is the political, economic and cultural center of the country. Paris is also the home of many famous landmarks such as the Eiffel Tower, the Louvre Museum, and Notre Dame Cathedral.
    2. llama-cpp-python: 'The capital of France is Paris. It is the capital city of France. It'
    3. This result is much closer, seems to indicate that the model is being loaded correctly, and something else is the issue.
  2. I tried playing around with temperature, top_p, none of them helped reconcile the 2 outputs, as discussed here.
  3. I thought it might be different parameters, so i cross checked llama-cpp-python's docs and llama.cpp's README. The only difference I could spot was:
    1. llama.cpp's max_tokens=128 and top_p=0.9
    2. llama-cpp-python max_tokens=16 and top_p=0.95
    3. Changing these also didn't help.
zhiyong9654 commented 7 months ago

Fixed after updating llama_cpp_python to 0.2.55