LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.81k stars 343 forks source link

Add proper tokenization endpoints to extras API. #548

Closed TaleirOfDeynai closed 8 months ago

TaleirOfDeynai commented 9 months ago

I just saw in a recent commit that you just took the existing api/extra/tokencount API end-point and had it include the token IDs.

If you're doing this, now might be a good time to just implement proper tokenization end-points that enable both encoding and decoding, allowing third-party frontends to do their own intelligent budgeting and context construction.

My recommendations for such an API are pretty simple: api/extra/token/encode - Receives { prompt: string } and responds with { tokens: number[] }. api/extra/token/decode - Receives { tokens: number[] } and responds with { prompt: string }. api/extra/token/mapping - Receives { prompt: string } and responds with { token_mappings: Array<[string, number]> }.

The last one might seem redundant, but for some advanced prompt construction tasks, it helps to know how tokenization breaks up words. For instance, the text fragment " bane-seekers to arms!" might map into:

{
  "token_mappings": [
    [" b", 275],
    ["ane", 1531],
    ["-", 12],
    ["seekers", 47971],
    [" to", 284],
    [" arms", 5101],
    ["!", 0]
  ]
}

If I'm doing word-based token trimming of the prompt and want to trim off everything after "bane-seekers", I would need to know where that words ends and this saves having to hit the token/decode endpoint at least 4 times to get individual translations of tokens to their decoded string fragment. Doing this in a single API call is far more efficient.

On the JavaScript side, you can also just feed the token_mappings array straight into the Map constructor and build a local dictionary/cache of these relevant tokens, which is also very handy and can cut down on how many API requests ultimately need to be made.

Finally, for API consistency, you could introduce an aliased api/extra/token/count endpoint and deprecate the old api/extra/tokencount endpoint, slated for removal in some V2 API. Having a simple endpoint that only tells you the number of tokens is still desirable.

LostRuins commented 9 months ago

This is not as straightforward as it seems, because of multibyte encodings.

something like I love 马铃薯 for example, converts to multiple tokens of IDs [' I (315)', ' love (2016)', ' (28705)', 'Θ⌐¼ (30259)', 'Θ (236)', 'ô (150)', 'â (134)', 'Φ (235)', 'û (153)', '» (178)'], many of the resultant bytes are not meaningful on their own, only when concatenated do they represent an extended unicode character. This also applies to other unicode constructs like Emojis.

I could make a detokenize endpoint though, the one where you send an array of ints and get back a single string.

LostRuins commented 9 months ago

Can you give me a good use case where someone would need to detokenize something from an array of IDs? I can't imagine it would really make much sense.

TaleirOfDeynai commented 9 months ago

Sorry for the lateness of my reply. I'm not getting email notifications from GitHub for some reason.

I already said what my use-case was: to do token-aware string processing. But I'll elaborate.

In one of my projects, I re-implemented NovelAI's context builder with various enhancements as a userscript; it does a lot of work splitting apart strings on different kinds of boundaries (paragraphs, sentences, space-separated words) and then updating the token representations as they're arranged together into a context of finite size. This was to try and maximize utilization of the available context space while also trying to favor presenting the most relevant information to the AI. I intend to utilize portions of that project in a custom kobold.cpp frontend.

Unfortunately, NAI's token encoder was quite slow and I found that if I was careful, I could encode the full string once and do a great deal of processing using the much, MUCH faster token decoder instead while only tapping the encoder as needed with smaller fragments to limit how much time I spent waiting on it. Even though this involved a clumsy binary search using the decoder, it still doubled the speed of my context builder (this is honestly why I want that mapping endpoint; so I can do this search with only one API call).

There are issues like the one you described to look out for, but I could just degrade to splitting the string normally and re-encoding the two fragments using the slower encoder when problems like that crop up. They never did; none of the splitting options I made available would try to rip a single character into multiple tokens, but I still put in a sanity check to detect unexpected changes to individual characters just in case.

Maybe kobold/llama.cpp has a super fast encoder and this isn't really a concern, though. I haven't yet benchmarked it.

But, another good use is just to have a tool that can show how strings are going to be tokenized. The GGUF format carries a lot of information on the functionality of the tokenizer (like its declared special tokens) and instead of using some online tool that probably doesn't actually reflect the model I'm using, I could just interrogate the tokenizer of the actual, loaded model.

And finally, if you're going to expose an endpoint that encodes a string into tokens, it is kind of baffling to not provide a means to reverse that process. If you're worried about extended unicode characters being mutilated, then put a warning in the docs about it. You gave a very good explanation of the problem you're concerned about.

LostRuins commented 8 months ago

Yes, the string-to-token-IDs is already added to /tokencount, on request from the sillytavern devs.

Adding an endpoint for the reverse, to upload an array of IDs, parsing it and obtain the string representation is quite a bit of effort and not really useful for most koboldcpp users, because the generate endpoint only accepts string inputs as the prompt anyways.

TaleirOfDeynai commented 8 months ago

Well, I can proceed without it, since I bet ggml's tokenizer is an order of magnitude faster than the JavaScript tokenizer NAI had embedded into its frontend as a background worker. Maybe it's not as big a deal that I squeeze the budget as hard as I was, now that we got 8K and greater contexts. I started exploring context construction back when AI Dungeon's scripting system only gave ~700 tokens to play with, so budget optimization was VERY important. I think I can probably loosen up a bit. :stuck_out_tongue:

Just keep in mind that the trend is heading towards token-first APIs. There is already pressure to start combining image tokens and string tokens into a single prompt for multi-modal stuff. Heck, we'll also probably want embeddings for long-term memory systems utilizing vector databases.

What I guess I'm saying is: don't kick the can TOO far down the road. We're pretty limited in the kinds of frontends we can build with a bare-bones "give string; receive completion" API. I really like kobold.cpp and its wide support for both new and old models, so that's why I wanna make it a target for my future frontend.