Closed shibe2 closed 2 weeks ago
Option 1 might actually be a good idea
In particular, when special token identifiers are added to the prompt by client and passed to server (which is a proper way to handle added tokens), it inserts space into each piece of text between special tokens.
Unless there's a bug ( and I don't think there is, I tested it extensively, though it's not impossible ), it's the other way around.
Tokenizing with "special=false" will only add space at the begining of the entire input string, and special tokens will be naively tokenized as plaintext. That is in line with old tokenizer behaviour.
Tokenizing with "special=true" will not add any spaces, neither at the begining of the input string nor to the substrings between special tokens.
EDIT:
Option 1 would leave you with stray space if special tokens are enabled and add_bos is enabled.
Example:
<|im_start|>user\nHi.<|im_end|>\n
Current behavior with special=false, same as old behavior:
bos,<
,|
,im
,_
,start
,|
,>
,user
,endl,Hi
,.
,<
,|
,im
,_
,end
,|
,>
,endl
Current (new) behaviour with special=true:
bos,<|im_start|>
,user
,endl,Hi
,.
,<|im_end|>
,endl
Tying space prefix to add bos when special=true would leave you with:
bos,`,
<|im_start|>,
user,endl,
Hi,
.,
<|im_end|>`,endl
@staviq I just tested it.
request: {"prompt":[1,32001,"name\nmessage",32002,"\n",32001,"name\nmessage",32002,"\n",32001,"name\n"]}
tokenized: [1,32001,1024,13,4906,32002,29871,13,32001,1024,13,4906,32002,29871,13,32001,1024,13]
decoded: "<s><|im_start|> name\nmessage<|im_end|> \n<|im_start|> name\nmessage<|im_end|> \n<|im_start|> name\n"
As you can see, there are no spaces at all in the request, but tokenizer uses tokens that start with space. That is, it inserts space into each text fragment.
@staviq I just tested it.
request:
{"prompt":[1,32001,"name\nmessage",32002,"\n",32001,"name\nmessage",32002,"\n",32001,"name\n"]}
tokenized:[1,32001,1024,13,4906,32002,29871,13,32001,1024,13,4906,32002,29871,13,32001,1024,13]
decoded:"<s><|im_start|> name\nmessage<|im_end|> \n<|im_start|> name\nmessage<|im_end|> \n<|im_start|> name\n"
As you can see, there are no spaces at all in the request, but tokenizer uses tokens that start with space. That is, it inserts space into each text fragment.
Oh, I see, server
has it's own implementation of tokenize
:
I thought you mean new special token handling broke this, but you meant it didn't fix an already existing problem in server
, correct ?
Yes, the compulsory space is affecting server's behavior ever since it was introduced in #2810. Maybe we can call it a problem, but I see it as a missing feature. In many cases, inserting space is desirable, but it takes control over it away from the client.
If my option 1 is chosen, it may be that nothing needs to be changed in the server code. But it should be documented in which cases the client needs to add spaces.
If my option 2 is chosen, the server can be made to behave like in option 1 by default and have optional parameters that control insertion of spaces. Again, with appropriate documentation.
Spaces aren't opportunistically inserted within the text, so doing it only when add_bos==true
would make sense to me. (Option 1)
A little extra info: The OP links to a hack that prepends a "🙂". Pretty smart! I've been prepending a newline, but am not sure if it will reliably yield a unique newline token for all models in all cases. If tokenizers are really unpredictable, clients that edit the context should be retokenizing at arbitrary boundaries (I think this is called "token healing"), which is also complicated by space insertion.
@grencez I am also using newline (LF) for working around the space insertion. I'm curious to see a model that uses SentencePiece tokenizer and has multiple tokens that contain newline, which could make this usage of newline problematic. One of things that I also do is splitting text on token boundaries, so this will be important even when space insertion will be controllable.
I'm curious to see a model that uses SentencePiece tokenizer and has multiple tokens that contain newline
@shibe2 The CausalLM GGUF models on HuggingFace do this sometimes. " \n\n" becomes 2 tokens (" \n" and "\n"). Those quantizations are still experimental and seem buggy (e.g., llama_token_nl()
differs from how "\n" is tokenized), so don't jump to any conclusions just yet.
Given that, BOS and initial space no longer seem to be nicely coupled. Therefore, I prefer my option 2 – adding a new argument to tokenization functions.
Something definitely should happen with this. The current behavior destroys the quality of some models. See https://github.com/ggerganov/llama.cpp/pull/4081#issuecomment-1817529589
The way we token also clearly differs from the official Python tokenizing stuff. (Example in the comment I linked.)
It seems the discussion is happening in parallel, here and in #4081
I looks to me like a single llama_tokenize
function might not be enough to solve this, because it seems like for every possible solution so far, there is a mutually exclusive use case.
So my proposal is to add llama_tokenize_advanced
( or something similar ), and I think we might want to use single bitfiled parameter for "settings" ( ORable enum )
This way, we won't break llama.cpp bindings when adding function arguments ( we/I did accidentally break llama-cpp-python by adding special
before ), and we would be able to modify and add functionality to the tokenizer, without breaking compatibility in the future.
So my proposal is to add
llama_tokenize_advanced
How about doing that but make it take a struct with options? Sort of like how the quantize stuff has struct llama_model_quantize_params
. Then you can basically add an unlimited amount of options to it without making the parameter passing insane or breaking the API whenever a new option gets added.
I was about to say that, but I thought it would be to radical of a change
But this is a good idea, Vulkan for example uses that to solve it's problem of ungodly amount of parameters in function calls.
So my proposal is to add
llama_tokenize_advanced
( or something similar )
I added it as option 3. I think, after adding that function we would still need to decide what to do with the existing function, because its API is not in a good shape given current needs.
we would still need to decide what to do with the existing function
It could call the advanced function with the options for special and add_bos set appropriately. Possibly also deprecate it if it really just doesn't have a use case anymore. I guess if we were going to say that then we should probably just break the API and replace the existing function with the proposed advanced one.
Once HF supports add_prefix_space for SPM (https://github.com/ggerganov/llama.cpp/pull/3538#issuecomment-1811899891), YiTokenizer should no longer be needed upstream (AFAIK) and we can store the value of add_prefix_space in the GGUF for both SPM and GPT2 tokenizers.
Would anybody be averse to me implementing Option 1?
My code does incremental tokenization/decoding, and I'm getting weird spaces where they should not be appearing. I'd rather fix this now than work around it.
I believe the original problem discussed in this issue has been resolved:
Can you give a specific example and how you use llama.cpp
tokenization that does not produce the expected results?
@ggerganov
I saw
becomes:
INFO : 106070: LlamaCppService: Adding text: ```I saw```
FINE : 106071: ContextTokens: Added BOS token to context
INFO : 106071: LlamaCppService: Added 3 tokens: [
0: 1 = <s>
1: 306 = ▁I
2: 4446 = ▁saw
]
Changing this to I saw Sam
becomes:
INFO : 163750: LlamaCppService: Adding text: ``` Sam```
INFO : 163750: LlamaCppService: Added 2 tokens: [
0: 29871 = ▁
1: 3685 = ▁Sam
]
I expect Sam
to become just 3685 = ▁Sam
.
I could trim prefix spaces from the added partial text, or throw away the first 29871 token from the result. However, this behavior is surprising and complicates the client.
I think it would improve the API to not add a space in this case.
This tokenization matches what I get when I use the SentencePiece implementation via Python and LLaMA tokenizer:
from sentencepiece import SentencePieceProcessor
tokenizer = SentencePieceProcessor(dir_tokenizer + '/tokenizer.model')
print(tokenizer.encode('I saw', add_bos=True))
# [1, 306, 4446]
print(tokenizer.encode(' Sam'))
# [29871, 3685]
I think it is important to match the reference implementation API so the leading 29871
whitespace token is correct and expected in this case.
For example, in this Python example, how would you use the SentencePiece tokenizer to achieve your goal? If you can provide an example to do that, we could decide on approach to extend llama.cpp
tokenization API in a similar way so that it is consistent. But simply removing the whitespace is not an option because this simple example will no longer work correctly.
For example, in this Python example, how would you use the SentencePiece tokenizer to achieve your goal?
You have to set NormalizerSpec
's add_dummy_prefix
field as false. Conveniently, a recent SentencePiece commit added this in a test (https://github.com/google/sentencepiece/blob/de1747bbd4b4f35c1f0432851a9fcd4def61c0b0/python/test/sentencepiece_test.py#L851-L866). This syntax seems too cumbersome for llama.cpp's API.
I do think the feature is important though, especially when dealing with prompt formats where there's a special token followed by non-space characters. For example, a user's input in the ChatML format starts with a <|im_start|>
token, then a string "user" with no preceding space, then a newline, then the user's input text which also doesn't begin with a space. To me, the cleanest/fastest way to append the user's input text would involve a tokenize function that doesn't add a space prefix.
Here's a relevant issue on SentencePiece: https://github.com/google/sentencepiece/issues/282.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Not stale - spaces are still inserted by default with special=False, and this is undesirable when joining multiple tokenized texts.
This issue was closed because it has been inactive for 14 days since being marked as stale.
This issue was closed because it has been inactive for 14 days since being marked as stale.
I believe, this is still an issue. Although for my builds, I disable this space insertion altogether.
I recently ran into this as well. I'm working on a library to make integrating a local llama file into an ios app easier, and one feature is the idea that you can steer the assistant by force choosing an option, based on what it said so far. the way it works is that sample grabs the token, converts it to a string using llama_token_to_piece
, and then calls a callback that a developer can use to return what the next string of words that the assistant should say next - often just returning the sampled value. But I chose to have the developer return it as a string, as opposed to as a llama_token.
Since the whitespace was already unescaped from llama_token_to_piece
, i'd like to be able to pass the string value of back to llama_tokenize
in a way that won't add another preceding space.
There are tokens in the vocab that shouldn't have a space prepended to it. for example, the word "I'm" gets tokenized like so:
from sentencepiece import SentencePieceProcessor
tokenizer = SentencePieceProcessor(dir_tokenizer + '/tokenizer.model')
print(tokenizer.encode("I'm"))
# [306, 29915, 29885]
but if you tokenize each one character by character
print(tokenizer.encode("I"))
# [306]
print(tokenizer.encode("'"))
# [525]
print(tokenizer.encode("m"))
# [286]
you get different values. If you decode those tokens together, you get "I ' m" instead of "I'm".
print(tokenizer.decode([306, 525, 286]))
"I ' m"
Basically, it would be nice to be able to take the input string "m" and be able to choose 29885 (which does not have prepended whitespace), instead of 286 (which has the U+2581 space prepended).
For my issue, I can sorta work around it for sampled tokens, and possibly design the library around it, but I was hoping to not need to expose the library users to the intricacies of tokenization (part of the inspiration for the library :P ). Does that make sense for why an option to not prepend a space?
This issue was closed because it has been inactive for 14 days since being marked as stale.
Current Behavior
Since #2810, space is inserted into any non-empty text. This breaks multiple use cases:
Recently, space insertion was disabled in the case when text representation of special tokens is recognized: https://github.com/ggerganov/llama.cpp/blob/1a159553f921a9209fed8c714494e57b3649f232/llama.cpp#L6729 This works for toying with
main
example when escape processing is enabled, but leaves other scenarios broken. In particular, when special token identifiers are added to the prompt by client and passed toserver
(which is a proper way to handle added tokens), it inserts space into each piece of text between special tokens.Space insertion was made to match original Python implementation, but that behavior is itself not optimal, as evidenced by people having to hack around it.
Proposals
Option 1
Insert space only when BOS token is also inserted. There is a clear intersection between cases where BOS and space need to be inserted: when the whole prompt is one chunk of text. All broken cases that I listed here involve splitting and recombining the prompt.
The argument
bos
can be optionally renamed to something likefull_prompt
, or inversepartial
, with meaning that would encompass the behaviors controlled by it.Option 2
Add a separate argument that controls insertion of space. It would be used by /tokenize in server and in other places.
Option 3
Add another tokenization function with options that would control many aspects of the process. Suggested by @staviq.