LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.34k stars 310 forks source link

Feature request: Option to disable auto adding BOS token (double BOS token) if it's already present/added. #917

Open Spacellary opened 2 weeks ago

Spacellary commented 2 weeks ago

How to disable this automatic behavior? And if it's not possible yet, can we get a --flag for it?

llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token.

Running into this with Llama-3-8B models.

Related PR: https://github.com/ggerganov/llama.cpp/pull/7332

LostRuins commented 2 weeks ago

How are you getting this? KoboldCpp automatically adds a BOS token at the start of the prompt, you don't have to add your own.

Spacellary commented 2 weeks ago

The model was converted to GGUF using the original configs from its own repo:

https://huggingface.co/Sao10K/L3-8B-Stheno-v3.2

https://huggingface.co/Sao10K/L3-8B-Stheno-v3.2/blob/main/tokenizer_config.json#L2052

Spacellary commented 2 weeks ago

Just so models don't have to be reconverted, after manually removing the bos_token from the tokenizer_config.json template, would it be possible to control/handle this behavior, regarding the automatic addition of the bos_token?

This does not help at all.

Similar situation from abetlen/llama-cpp-python.

Spacellary commented 2 weeks ago

@LostRuins – I should clarify for transparency/investigation:

How are you getting this? KoboldCpp automatically adds a BOS token at the start of the prompt, you don't have to add your own.

Using the latest release of KCPP as my inference backend on Windows 11, CuBLAS, connected to SillyTavern where I am interacting with the model/character.

If you believe this should be handled upstream, please let me know, honestly I'm not sure myself.

LostRuins commented 2 weeks ago

Hmm okay so the issue is, let's say I manually edit the prompt to remove the first BOS token if the user adds another one. What if they add 2 BOS tokens instead? Or what if they actually want to have 2,3, or more BOS tokens? Changing the BOS behavior based on what they send in the prompt seems kind of finnicky - either the backend should add a BOS automatically or it shouldn't at all - then the frontend can expect consistent behavior.

Fortunately, this doesn't actually seem to be an issue - having a double BOS in the prompt does not seem to negatively impact output quality at all, the first one is just ignored.

Spacellary commented 2 weeks ago

What if they add 2 BOS tokens instead? Or what if they actually want to have 2,3, or more BOS tokens?

This would be optional, of course. But I also agree that the outputs should be consistent for the sake of the frontends.

Fortunately, this doesn't actually seem to be an issue - having a double BOS in the prompt does not seem to negatively impact output quality at all, the first one is just ignored.

I was wondering about that since I didn't notice any issues other than the new warning as it was added upstream, but it makes people think something is wrong.

From the user side, it looks like either the model or the backend is doing something incorrectly. Considering that even after manually changing the models chat_template this warning still persists, I'm not sure.