Open Spacellary opened 2 weeks ago
How are you getting this? KoboldCpp automatically adds a BOS token at the start of the prompt, you don't have to add your own.
The model was converted to GGUF using the original configs from its own repo:
https://huggingface.co/Sao10K/L3-8B-Stheno-v3.2
https://huggingface.co/Sao10K/L3-8B-Stheno-v3.2/blob/main/tokenizer_config.json#L2052
Just so models don't have to be reconverted, after manually removing the bos_token from the tokenizer_config.json template, would it be possible to control/handle this behavior, regarding the automatic addition of the bos_token?
This does not help at all.
@LostRuins – I should clarify for transparency/investigation:
How are you getting this? KoboldCpp automatically adds a BOS token at the start of the prompt, you don't have to add your own.
Using the latest release of KCPP as my inference backend on Windows 11, CuBLAS, connected to SillyTavern where I am interacting with the model/character.
If you believe this should be handled upstream, please let me know, honestly I'm not sure myself.
Hmm okay so the issue is, let's say I manually edit the prompt to remove the first BOS token if the user adds another one. What if they add 2 BOS tokens instead? Or what if they actually want to have 2,3, or more BOS tokens? Changing the BOS behavior based on what they send in the prompt seems kind of finnicky - either the backend should add a BOS automatically or it shouldn't at all - then the frontend can expect consistent behavior.
Fortunately, this doesn't actually seem to be an issue - having a double BOS in the prompt does not seem to negatively impact output quality at all, the first one is just ignored.
What if they add 2 BOS tokens instead? Or what if they actually want to have 2,3, or more BOS tokens?
This would be optional, of course. But I also agree that the outputs should be consistent for the sake of the frontends.
Fortunately, this doesn't actually seem to be an issue - having a double BOS in the prompt does not seem to negatively impact output quality at all, the first one is just ignored.
I was wondering about that since I didn't notice any issues other than the new warning as it was added upstream, but it makes people think something is wrong.
From the user side, it looks like either the model or the backend is doing something incorrectly. Considering that even after manually changing the models chat_template this warning still persists, I'm not sure.
How to disable this automatic behavior? And if it's not possible yet, can we get a --flag for it?
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token.
Running into this with Llama-3-8B models.
Related PR: https://github.com/ggerganov/llama.cpp/pull/7332