[BUG]: Tokenization in 0.14.0 adds spaces

newsletternewsletter commented 1 month ago

Description

When tokenizing a text and decoding these tokens, one can see that tokenization now (as of version 0.14.0) adds one additional starting space to text for every call of Context.Tokenize(text, addBos, special). This is especially bad if a text is tokenized with more than one call. Version 0.13.0 did not exhibit such behavior. Or at least, it did not add spaces at the start of words, changing their token ids.

This seems fine for most models (I saw this when using trollek/NinjaMouse-2.4B-32L-danube), but when I use gemma-1.1-2b-it-Q6_K.gguf (from: bartowski/gemma-1.1-2b-it-GGUF) now, it's not working anymore. The prompt was:

<start_of_turn>user
Who are you?<end_of_turn>
<start_of_turn>model

Validating with tokenize from llama.cpp b2985 (used in LlamaSharp Version 0.13.0):

     2 -> '<bos>'
   106 -> '<start_of_turn>'
  2425 -> ' user'
235286 -> '\'
235254 -> 'n'
  6571 -> 'Who'
   708 -> ' are'
   692 -> ' you'
235336 -> '?'
   107 -> '<end_of_turn>'
   730 -> ' \'
235254 -> 'n'
   106 -> '<start_of_turn>'
  2091 -> ' model'
235286 -> '\'
235254 -> 'n'

Interestingly the token at position 2 with id 2425 ' user' adds a starting space to 'user' (id 1645).

But even the latest llama.cpp b3412 does not work correctly, look at token at position 2 with id 968 ' <' :

     2 -> '<bos>'
   968 -> ' <'
  2997 -> 'start'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
  1645 -> 'user'
   108 -> '
'
  6571 -> 'Who'
   708 -> ' are'
   692 -> ' you'
181537 -> '?<'
   615 -> 'end'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
   108 -> '
'
235322 -> '<'
  2997 -> 'start'
235298 -> '_'
   559 -> 'of'
235298 -> '_'
 15508 -> 'turn'
235313 -> '>'
  2516 -> 'model'
   108 -> '
'

Is there a way to completely prevent extra spaces from being added by tokenization anywhere? I will tokenize them by hand if necessary. 😉

Reproduction Steps

Write the prompt (see above) to prompts.txt and run:

for llama.cpp b2985:

tokenize.exe "gemma-1.1-2b-it.Q6_K.gguf" "<start_of_turn>user\nWho are you?<end_of_turn>\n<start_of_turn>model\n"

or for llama.cpp b3412:

llama-tokenize.exe -m "gemma-1.1-2b-it.Q6_K.gguf" -f "prompt.txt"

Environment & Configuration

Operating system: Windows 10
.NET runtime version: 8.0
LLamaSharp version: 0.14.0
CPU device: Intel Core i7

Known Workarounds

I would love to know!

martindevans commented 1 month ago

If you're seeing the wrong behaviour in llama-tokenize.exe this looks like it's probably an upstream bug?

newsletternewsletter commented 1 month ago

If you're seeing the wrong behaviour in llama-tokenize.exe this looks like it's probably an upstream bug?

Yes, indeed! I opened a bug ticket there: ggerganov/llama.cpp/issues/8584.

newsletternewsletter commented 1 month ago

The behavior when the tokenizer adds a space to the first non-special token can be customized via the key tokenizer.ggml.add_space_prefix. There are 2 workarounds (https://github.com/ggerganov/llama.cpp/issues/8584#issuecomment-2240032341):

Using a KV override: tokenizer.ggml.add_space_prefix=bool:false.
Changing the model's KV metadata: add this key and set its value to false.

An acceptable workaround: Changing the KV metadata in the GGUF file via a python script works wonders (using a modified gguf-py/scripts/gguf_new_metadata.py from llama.cpp).

However, trying a KV override via ModelParams.MetadataOverrides does not seem to work. When adding modelParams.MetadataOverrides.Add(new MetadataOverride("tokenizer.ggml.add_space_prefix", false)) before loading the model via LLamaWeights.LoadFromFileAsync, the KV override is ignored and the tokenizer adds a space.

This is an upstream bug as it is reproducible with llama-cli.

newsletternewsletter commented 1 month ago

It is being fixed upstream: https://github.com/ggerganov/llama.cpp/pull/8614

Oceania2018 commented 1 month ago

Gemma 2 2B is released, it's even surpassing GPT-3.5-turbo

newsletternewsletter commented 1 month ago

However, trying a KV override via ModelParams.MetadataOverrides does not seem to work. When adding modelParams.MetadataOverrides.Add(new MetadataOverride("tokenizer.ggml.add_space_prefix", false)) before loading the model via LLamaWeights.LoadFromFileAsync, the KV override is ignored and the tokenizer adds a space.

I tried again with LLamaSharp 0.15.0 and although it has been fixed upstream (https://github.com/ggerganov/llama.cpp/pull/8614), the KV override in LLamaSharp via ModelParams.MetadataOverrides does not work, neither with models that have tokenizer.ggml.add_space_prefix set to true (e.g. Lite-Mistral-150M-v2-Instruct), nor with ones without the key tokenizer.ggml.add_space_prefix (many old quants).

SciSharp / LLamaSharp