botisan-ai / gpt3-tokenizer

Isomorphic JavaScript/TypeScript Tokenizer for GPT-3 and Codex Models by OpenAI.
MIT License
171 stars 19 forks source link

[Issue] Calculate Tokens size? #15

Open rk-teche opened 1 year ago

rk-teche commented 1 year ago

Token size is not accurate if we compare it with GPT-3 Token.

Any help would be helpful. Thanks

evilDave commented 1 year ago

Did you have an example that (still) does not work - the token count is identical for any text that I have checked.

lhr0909 commented 1 year ago

@rk-teche thank you for your feedback! There could be a discrepancy with the current OpenAI models, especially when compare with token counts from the API outputs. I am going to spend some time to try to move token calculation to use OpenAI's own tiktoken inside my package, as part of v2 work.

Aldo111 commented 1 year ago

Hi I found one issue where this package doesn't count newlines properly while the gpt tokenizer adds 2 tokens per newline.

Eg for "Hello\n\n" this package returns 2 tokens but the online gpt tokenizer returns 5. Does the package trim the text or such?

image
kitfit-dave commented 1 year ago

I think what you will find is that the online tokenizer does not recognise \n as a newline (but the two characters \ and n). Just put in two hard newlines and you will get 2 tokens, also, look at the token ids for your entered string: [15496, 59, 77, 59, 77] where 59 is \ and 77 is n. Alternatively, test gpt3-tokenizer with the string 'Hello\\n\\n' and it will come out as 5.

Aldo111 commented 1 year ago

Yep I'm aware of \ + n being counted as separated since it showed it clearly in the tokenizer screenshot above. In your last example then would the most appropriate way be to escape the string (or special chars) before passing it to the tokenizer?

Alternatively what I've ended up doing is using the tokenizer as an estimation and not a fact (which also generally makes sense given the documentation and model differences long term) and following the Deep Dive Counting Tokens guide (for gpt3.5+) in the OpenAI docs. The combination of gpt3-tokenizer with the estimations they've provided in the doc is super helpful and brings the results a bit closer to accuracy.

kitfit-dave commented 1 year ago

For passing to the tokeniser, you should escape in the regular javascript way, so Hello followed by two newlines is "Hello\n\n" - is that not giving you 2 tokens? Or are you saying that you think the answer should be 5? The online tokeniser where your screenshot is from does not accept escaped characters, only literal characters - if you want a newline there you should type a newline, only then are you comparing apples to apples.

I've been commenting on these issues where folks are saying that "it's an estimate" or "it's not correct" because I switched to this library because it seems to be exactly correct. I felt of work had been done in this project to make it so, and I'd like everyone to benefit from that, knowing the results are accurate.