dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Expose Encoder in TiktokenTokenizer #7313

Open razshare opened 6 days ago

razshare commented 6 days ago

Hello, first of all thank your very much for this project!

Is your feature request related to a problem? Please describe. Yes, it is. Some of our clients may have outdated encodings on their client application. We still want our clients to have access to new encodings even if their client application is not up to date, hence we want to serve the encoder dictionary from a server endpoint.

A clear and concise description of what the problem is. The problem is that, currently, the Encoder property in TiktokenTokenizer is internal.

https://github.com/dotnet/machinelearning/blob/509032755b912e7bb0dd50c10a3172ead57965f3/src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs#L998-L1001

Describe the solution you'd like I would like to expose this Encoder property. There seems to be the intent to expose this property at some point in the future. https://github.com/dotnet/machinelearning/blob/509032755b912e7bb0dd50c10a3172ead57965f3/test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs#L732-L740

Maybe this is the time to do it, what do you think?

Describe alternatives you've considered Maybe a separate method that does exactly what that test from above does using reflection. Sounds like overkill and a lot of overhead though. Exposing the property is probably the best way to deal with this.

Additional context I'm sending a PR your way with the changes, feel free to ask for/make any modifications you think are necessary.

tarekgh commented 3 days ago

@razshare could you please elaborate more why you need to expose it and how you are planning to use it? I read the description, but more details about your scenario will help here.

razshare commented 2 days ago

@tarekgh of course. I'll come back to you with a more in depth explanation and possibly some drawings/schemas to make it easier to understand.

razshare commented 1 day ago

Hello @tarekgh , as promissed here's a more in depth explanation.


LLM based client applications often require being able to count the number tokens in a given string.\ The reasons can be multiple.

  1. Sometimes it's necessary to limit the context sent to the LLM in order to reduce costs.
  2. Other times when dealing with a server it is necessary to avoid being rate limited, and so counting the tokens before sending them can help with that.

These are 2 examples I'm actively dealing with atm, I would imagine there are other reasons too, which I have yet to encounter. The point being is that counting tokens client-side is useful.\ Most of the times the client application can simply make use of the TiktokenTokenizer class itself.

In most cases, for the client to be able to count how many token a given string actually contains, they would simply invoke

var tokenizer = TiktokenTokenizer.CreateForModel(modelId);

to create a tokenizer and then proceed to add the logic for counting the tokens in a string

var numberOfTokens = tokenizer.CountToken(myInputString);

However, sometimes due to constraints out of our control, the client application cannot stay 100% all the time up to date with the latest changes in TiktokenTokenizer.

In my case, I need to offer backward compatibility on the LLM side of things.

Some clients are not able to update their client application, which means their version of TiktokenTokenizer would become outdated pretty fast at the pace at which new models seem to come out.

If a client application has an out of date TiktokenTokenizer, it should still be able to interact with new models and count tokens locally by simply changing the model id in a configuration panel.

So with that in mind, there are some cases in which the first invocation will fail

var tokenizer = TiktokenTokenizer.CreateForModel("my-new-fancy-model-from-year-2030");

because the application, as built at the time, would not contain encodings for model my-new-fancy-model-from-year-2030.

All is not lost though, because TiktokenTokenizer.Create exists. https://github.com/dotnet/machinelearning/blob/509032755b912e7bb0dd50c10a3172ead57965f3/src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs#L1278-L1284

Which is agnostic to the model name/id, it just takes in the raw encoder dictionary.

This opens the gates to a solution in which the server plays a role, in order to solve this backward compatibility issue.

Image

When the client is out of date and is unaware of a specific model name/id, we fallback to the server, we retrieve the encodings of said model and finally create a new tokenizer using directly the encodings.\ We do this by using something like this

var buffer = UTF8Encoding.UTF8.GetBytes(base64Encodings);
var stream = new MemoryStream(buffer);
var tokenizer = TiktokenTokenizer.Create(stream);

[!NOTE] The base64Encodings variable is the contents of the raw encoder (obtained from the server), encoded in base64, as required by TiktokenTokenizer.Create.

After that, the client can proceed to count the token as usual.

[!NOTE] And ofc the client-side application may even cache these encodings locally, so that the next time it encounters a request for said new and shiny model, it doesn't have to query the server, but instead it would use the cached encodings.

Currently, retrieving these raw encodings on the server side can only be done through reflection, as shown in the original test file. https://github.com/dotnet/machinelearning/blob/509032755b912e7bb0dd50c10a3172ead57965f3/test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs#L732-L740

This is probably fine for you, the authors.

But we're not authoring, and accessing internal properties this way doesn't guarantee that these properties won't change one day without being tagged with a major release number.\ Basically it's dangerous for us, users of the library, to do these kind of things.

On top of that, it's reflection, it has a performance impact as well.

[!NOTE] Although it's probably very small.

Hence the solution: make the encoder public.

Let me know if this clarifies the reason for this change, what problem it's actually trying to solve, and perhaps if you think there's a better, more ergonomic, more future-proof (and so on) approach.

tarekgh commented 8 hours ago

@razshare thanks a lot for the details. It is super helpful. One follow-up question, is the server always in control of the source of the tokenizer data? I mean, can the server always create the tokenizer using the stream (instead of calling CreateForModel)? If you can do that, it will be simpler for the server to just stream the content to the client without any processing. (I mean will avoid getting the encoder, encode it as UTF-8 base64 and send it to the client).

By the way, I am not objecting to your proposal, I am just brainstorming how to support the scenario in an efficient way. If we need to go with your proposal, we may think about exposing tokenizer Create method that allows taking the Encoder data too. Also, calling Create method passing the stream only will not enough as you need to pass the pre-tokenizer and special tokens too.

razshare commented 5 hours ago

Hello again @tarekgh !

is the server always in control of the source of the tokenizer data? I mean, can the server always create the tokenizer using the stream (instead of calling CreateForModel)

In my case, the server itself never creates the actual tokenizer instance, there's no invocation for TiktokenTokenizer.CreateForModel() or TiktokenTokenizer.Create() on the server.\ Only the client calls TiktokenTokenizer.CreateForModel() with a model id.\ If it fails then it tries to retrieve the encodings of said model from the server and then tries to call TiktokenTokenizer.Create() instead.

If you can do that, it will be simpler for the server to just stream the content to the client without any processing

If by content you mean the encoder dictionary, then yes, that's exactly it.\ And yes, the processing part (encoding to base64) can also be skipped to the on the server.

you need to pass the pre-tokenizer and special tokens too.

Yeah, I left those out for the time being in order to focus on the encoder specifically, also because we haven't wrestled with that part so far, we've just been omitting those parameters for the sake of simplicity in order to get the architecture to work and solve the backward compatibility issue.

Pre-tokenizer and special tokens will come after, for the moment I'm aiming to allow the client to successfully create a tokenizer from a remote encoder dictionary.

we may think about exposing tokenizer Create method that allows taking the Encoder data too

Without converting it to a stream? It's not strictly necessary, but that sounds like a good quality of life improvement to me. The .Net touch!