ahyatt / llm

A package abstracting llm capabilities for emacs.
GNU General Public License v3.0
178 stars 24 forks source link

Use tiktoken.el for token counting of openai's models #14

Open zkry opened 9 months ago

zkry commented 9 months ago

Hello!

I noticed that one of the methods for the providers is llm-count-tokens which currently does a simple heuristic. I recently wrote a port of tiktoken that could add this functionality for at least the OpenAI models. The implementation in llm-openai.el would essentially look like the following:

(require 'tiktoken)
(cl-defmethod llm-count-tokens ((provider llm-openai) text)
  (let ((enc (tiktoken-encoding-for-model (llm-openai-chat-model provider))))
    (tiktoken-count-tokens enc text)))

There would be some design questions like should it use the chat-model or the embedding-model when doing this. Like maybe it would first try to count with the embedding-model if it exists, otherwise the chat-model, with some default.

Definitely let me know your thoughts and I could have a PR up for it along with any other required work.

ahyatt commented 9 months ago

Very interesting, thanks for sharing this! Before we go further, do you have FSF copyright assignment already, or if not, are you willing to get it? Since this is part of GNU ELPA, all contributions must be from those who have assigned copyright to the FSF.

zkry commented 9 months ago

Yeah! I have the FSF copyright paperwork in so I should be good there.

ahyatt commented 9 months ago

Great, in that case to use your encoder, we could either put your library in ELPA (you would do this via emacs-devel@ mailing list), which I can then depend on, or include your encoder in the llm library directly.

What's the difference in accuracy, do you think? Is it worth it to include this code?

And as far as embedding vs chat, from what I understand, they use the same encoder, cl100k_base, so for Open AI it shouldn't matter. My library also doesn't make a distinction between tokens for embeddings and chat. Of the two, chat makes the most sense to have token counting for, so it should probably be thought of as providing token counting for chat.

zkry commented 9 months ago

What's the difference in accuracy, do you think? Is it worth it to include this code?

Good question. I tested tiktoken vs two different heuristics (one just dividing the number of characters by 4) on a variety of code and text files and these are the results I got:

comparison

zoomed in to the lower counts, comparison_zoomed

And here are only the prose files (the outlier is non-ascii text): prose

It looks like for English prose both heuristics perform really well. For code, the (/ (buffer-size) 4.0) does seem to perform better.

So if we were to go with (/ (buffer-size) 4.0), here are the percentage differences that would be expected to be off:

Figure_3

so on average, it looks like it would be 10% off. With the current heuristic, it is on average 30% off.

So with all that said I'm not sure how worth it would be. I think the more advanced the use case is the more accurate the count would be wanted. Also, non-ASCII characters seem to be more off with the heuristic as well. But calculating it isn't trivial and (/ (buffer-size) 4.0) would get it most of the way... I'm not sure what would be best.

Let me know if you think including it would be best and I can either add the code or put tiktoken.el to ELPA. Edit: Maybe just adding the code to this repo would make the most sense as tiktoken.el wouldn't really make sense as a standalone ELPA package.

ahyatt commented 9 months ago

Great analysis, thank you so much for that!

Let's keep this issue open - it might become critical in the future, but first I need to do other things before I think we'd need this, namely:

1) get max token counts per provider / operation (in progress) 2) develop a prompting system that can flexibly get content up to the max tokens, in ways that make sense for different operations. But how precise things need to be is unclear - do we even want to approach the max? There are disadvantages from doing so, since it should (in theory, at least) decreate conversation quality, which also needs those tokens. If we had a rule like try to get to 2/3 of max, that would mean we don't need to be so precise with the token counting.

Let's see where things take us. Thanks again for developing this library and reaching out about it.

zkry commented 9 months ago

Sounds good! I agree that those would be best to tackle first.