dqbd / tiktokenizer

Online playground for OpenAPI tokenizers
https://tiktokenizer.vercel.app
MIT License
707 stars 88 forks source link

Can you add a function to `tiktoken` that automatically adds special characters to chat messages? #9

Closed Evertt closed 1 year ago

Evertt commented 1 year ago

I see that the text that is generated from the messages automatically gets the special tokens added to it:

image

Such as <|im_start|>, <|im_end|>\n and it even always ends with <|im_start|>assistant.

That makes me wonder, when I'm trying to encode and count the tokens of an entire chat that I have using tiktoken, am I responsible for formatting my text in the correct way with the correct special tokens placed in the correct places?

Or is there a function, where I can just give an array of messages where each item has the following shape?

{ role: "user", content: "I need some help with MS Word!" }

And that then tiktoken's encoder would automatically add the right special tokens in the right places? I know it would be fairly trivial to make such a function by myself, but if you'd be willing to add it to tiktoken then I have more trust that if OpenAI ever changes anything about the special tokens / formatting that they use internally, that then you'd probably pick up on that and update your package accordingly.

dqbd commented 1 year ago

Hi @Evertt

Yep! In the near future I would like to add a token counting feature directly into tiktoken, taking account both the ChatML / message format and function / function_call counting as well.

Before that is done though, https://github.com/dqbd/tiktokenizer/blob/1ec7a71142ae7c3ac968810c4977597021f29f4a/src/sections/ChatGPTEditor.tsx#L14-L31 should suffice if you're in a rush