LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.98k stars 349 forks source link

Special tokens are not rendered -> can't stop based on them #791

Open DreamGenX opened 5 months ago

DreamGenX commented 5 months ago

Hello!

When loading this model as GGUF https://huggingface.co/LoneStriker/opus-v1.2-llama-3-8b-GGUF the special tokens are not rendered for some reason, which breaks the "stop string" functionality.

Specifically, I am setting <|im_end|> as a stop string, but because the logic relies on string comparison and the token id belonging to that is rendered as empty, it never stops.

The <|im_end|> is tokenized correctly as token id 128009, I checked by inspecting the prompt token id in debug mode. It's just not rendered.

LostRuins commented 5 months ago

You raise a very good point. I shall add the ability to correctly handle a stop sequence as a special token if it's just a single special token when tokenized. That way, you will be able to add something like <|eot_id|> to stop_sequences and it will just work.

LostRuins commented 5 months ago

Should be fixed in latest release

DreamGenX commented 5 months ago

Thank you very much @LostRuins -- is there any reason to also not render the special tokens? That way you could e.g. do stop sequence like <|im_start|>user and that indeed works with backends that rely on HF tokenizers like vLLM or Aphrodite -- and iirc, it also work for koboldcpp and llamacpp with some other models (e.g. my older Mistral 7B chat ml model)

LostRuins commented 5 months ago

Well, special tokens don't always have a string representation. Most of the time, they're unwanted in output, and piping them there would require clients to manually parse and get rid of them before displaying the content to the user. Also, the current behavior that upstream llama.cpp is to map all special tokens to the empty string when detokenizing. To be honest I do wish they had used regular tokens instead.

DreamGenX commented 5 months ago

@LostRuins Most backends like vllm etc. have this as an option (to render or not render special tokens). Rendering special tokens allows you to properly parse the response, which is useful when the output is semi-structured ala ChatML and is one of the reason for having special tokens in the first place.

Regarding your side-note, special tokens have lots of nice advantages compared to regular tokens, namely that they are always tokenized as one unit, have one purpose, and are not present in the input.

LostRuins commented 5 months ago

Hi, Should be fixed in the latest version, you can now pass render_special to the api to force special tokens to be output.

DreamGenX commented 5 months ago

Awesome, thank you!