abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.72k stars 928 forks source link

Switch to disable adding BOS token #1561

Open etemiz opened 2 months ago

etemiz commented 2 months ago

Is your feature request related to a problem? Please describe. I am building the prompt myself and calling

llm.create_completion(prompt, max_tokens=max_tokens,
                               temperature=temperature, repeat_penalty=repeat_penalty,
                               stop=stops)

llama.cpp is telling me it is adding yet another in the beginning which could affect the performance:

RuntimeWarning: Detected duplicate leading "<bos>" in prompt, this will likely reduce response quality, consider removing it

Describe the solution you'd like Either llama.cpp should not add a token in the beginning or there should be a switch.

Additional context

This is a prompt with gemma2 template that I give to the create_completion function:

<bos><start_of_turn>user
You are a helpful chat bot, answering questions.
<end_of_turn><start_of_turn>model
OK<end_of_turn><start_of_turn>user
What kind of questions can I ask you?<end_of_turn><start_of_turn>model
Ph0rk0z commented 2 months ago

Even more fun for qwen2 which doesn't have a BOS token and may have some rando quanter set token that will degrade output.