Open pmbaumgartner opened 3 months ago
This was necessary to avoid double BOS tokens, may I ask why you are calling format_llama3
directly?
The LlamaChatCompletionHandler
that is called if you just use the chat_format
parameter will handle this automatically:
https://github.com/abetlen/llama-cpp-python/blob/027f7bc67890f1de801407fbbb608c182e2ad286/llama_cpp/llama_chat_format.py#L552
If you really need to call ChatFormatter
s directly I suggest you check the new added_special
property to see if tokenizer should be called with add_bos=True
or not.
Also, no need to use chat_format
on models that have the correct chat_template
embedded, they will automatically be used for chat completion.
Are there tests that verify the resulting prompts that are sent to the model the same after this change? I'm getting different results with the same model in these two versions.
Or should I assume the different results are because the prior prompt had a double BOS token?
This was necessary to avoid double BOS tokens, may I ask why you are calling format_llama3 directly?
I am trying to use instruction-tuned models with Outlines. Currently they don't support chat completion, so I'm manually filling in the prompt template with a messages
object and format_llama3
, like so:
prompt = format_llama3(
messages=[
{
"role": "system",
"content": "You are a helpful cooking assistant.",
},
{"role": "user", "content": "Give me the best recipe for banana pudding."},
]
).prompt
Using format_llama3
was just the easiest way to access the chat template necessary and fill in the data and get the result as a string that I can pass to the generator for outlines. If there's another way to easily complete the template and get the resulting string, that's what I'm trying to get.
Or should I assume the different results are because the prior prompt had a double BOS token?
Yes, this is exactly what happened before after tokenization. As far as I can tell this is also the case with Outlines
:
https://github.com/outlines-dev/outlines/blob/3a7d83b89afcf6a3ecd53b134bf226c5041d674d/outlines/models/llamacpp.py#L57-L66
Using
format_llama3
was just the easiest way to access the chat template necessary and fill in the data and get the result as a string that I can pass to the generator for outlines. If there's another way to easily complete the template and get the resulting string, that's what I'm trying to get.
You should be fine, Outlines
will add BOS at tokenization, so it should be generating the correct prompt now, as opposed to the double BOS you would have gotten with previous versions.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Llama3 models using the prompt template in llama-cpp-python>0.2.77 are missing the BOS token and then the model quality is degraded.
For example, the Meta doc on Llama 3 has prompt template examples here: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
Current Behavior
Result of an empty
format_llama3({})
afterfrom llama_cpp.llama_chat_format import format_llama3
:0.2.78:
0.2.77:
Environment and Context
Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
Steps to Reproduce
Install versions 0.2.77 and 0.2.78 and test anything with a Llama 3 model and
llm.create_chat_completion
. In addition, import the templates and check the difference.I am using
https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/tree/main
as a model for testing, particularly the Q5_K_M quantization, but this should affect all models.