Closed Evertt closed 1 year ago
Hi @Evertt
Yep! In the near future I would like to add a token counting feature directly into tiktoken, taking account both the ChatML / message format and function / function_call counting as well.
Before that is done though, https://github.com/dqbd/tiktokenizer/blob/1ec7a71142ae7c3ac968810c4977597021f29f4a/src/sections/ChatGPTEditor.tsx#L14-L31 should suffice if you're in a rush
I see that the text that is generated from the messages automatically gets the special tokens added to it:
Such as
<|im_start|>
,<|im_end|>\n
and it even always ends with<|im_start|>assistant
.That makes me wonder, when I'm trying to encode and count the tokens of an entire chat that I have using
tiktoken
, am I responsible for formatting my text in the correct way with the correct special tokens placed in the correct places?Or is there a function, where I can just give an array of messages where each item has the following shape?
And that then
tiktoken
's encoder would automatically add the right special tokens in the right places? I know it would be fairly trivial to make such a function by myself, but if you'd be willing to add it totiktoken
then I have more trust that if OpenAI ever changes anything about the special tokens / formatting that they use internally, that then you'd probably pick up on that and update your package accordingly.