abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.84k stars 938 forks source link

Llama3 instruct prompt template missing BOS token #1537

Open pmbaumgartner opened 3 months ago

pmbaumgartner commented 3 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Llama3 models using the prompt template in llama-cpp-python>0.2.77 are missing the BOS token and then the model quality is degraded.

For example, the Meta doc on Llama 3 has prompt template examples here: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

Current Behavior

Result of an empty format_llama3({}) after from llama_cpp.llama_chat_format import format_llama3:

0.2.78:

ChatFormatterResponse(prompt='<|start_header_id|>assistant<|end_header_id|>\n\n', stop='<|eot_id|>', stopping_criteria=None, added_special=False)

0.2.77:

ChatFormatterResponse(prompt='<|begin_of_text|><|start_header_id|>assistant<|end_header_id|>\n\n', stop='<|eot_id|>', stopping_criteria=None)

Environment and Context

22.6.0 Darwin Kernel Version 22.6.0: Mon Apr 22 20:49:37 PDT 2024; root:xnu-8796.141.3.705.2~1/RELEASE_ARM64_T6000 arm64
Python 3.10.13
GNU Make 3.81
$ g++ --version

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Install versions 0.2.77 and 0.2.78 and test anything with a Llama 3 model and llm.create_chat_completion. In addition, import the templates and check the difference.

I am using https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/tree/main as a model for testing, particularly the Q5_K_M quantization, but this should affect all models.

CISC commented 3 months ago

This was necessary to avoid double BOS tokens, may I ask why you are calling format_llama3 directly?

The LlamaChatCompletionHandler that is called if you just use the chat_format parameter will handle this automatically: https://github.com/abetlen/llama-cpp-python/blob/027f7bc67890f1de801407fbbb608c182e2ad286/llama_cpp/llama_chat_format.py#L552

If you really need to call ChatFormatters directly I suggest you check the new added_special property to see if tokenizer should be called with add_bos=True or not.

CISC commented 3 months ago

Also, no need to use chat_format on models that have the correct chat_template embedded, they will automatically be used for chat completion.

pmbaumgartner commented 3 months ago

Are there tests that verify the resulting prompts that are sent to the model the same after this change? I'm getting different results with the same model in these two versions.

pmbaumgartner commented 3 months ago

Or should I assume the different results are because the prior prompt had a double BOS token?

pmbaumgartner commented 3 months ago

This was necessary to avoid double BOS tokens, may I ask why you are calling format_llama3 directly?

I am trying to use instruction-tuned models with Outlines. Currently they don't support chat completion, so I'm manually filling in the prompt template with a messages object and format_llama3, like so:

prompt = format_llama3(
    messages=[
        {
            "role": "system",
            "content": "You are a helpful cooking assistant.",
        },
        {"role": "user", "content": "Give me the best recipe for banana pudding."},
    ]
).prompt

Using format_llama3 was just the easiest way to access the chat template necessary and fill in the data and get the result as a string that I can pass to the generator for outlines. If there's another way to easily complete the template and get the resulting string, that's what I'm trying to get.

CISC commented 3 months ago

Or should I assume the different results are because the prior prompt had a double BOS token?

Yes, this is exactly what happened before after tokenization. As far as I can tell this is also the case with Outlines: https://github.com/outlines-dev/outlines/blob/3a7d83b89afcf6a3ecd53b134bf226c5041d674d/outlines/models/llamacpp.py#L57-L66

Using format_llama3 was just the easiest way to access the chat template necessary and fill in the data and get the result as a string that I can pass to the generator for outlines. If there's another way to easily complete the template and get the resulting string, that's what I'm trying to get.

You should be fine, Outlines will add BOS at tokenization, so it should be generating the correct prompt now, as opposed to the double BOS you would have gotten with previous versions.