dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

Still no o200k_base support #7154

Closed Xan-Kun closed 1 month ago

Xan-Kun commented 1 month ago

System Information (please complete the following information):

Describe the bug No way to tokenize gpt-4o strings!

To Reproduce Tokenize a string for gpt-4o

Expected behavior The most recent models are supported.

stephentoub commented 1 month ago

a) Please be respectful. b) Even the official openai/tiktoken repo only merged support for gpt-4o 3 days ago: https://github.com/openai/tiktoken/commit/9d01e5670ff50eb74cdb96406c7f3d9add0ae2f8. Integrated support will be coming here as well. c) Tokenizer does support the new model, it's just not as integrated as for the other models because the vocab file and associated regex aren't baked in. I expect @tarekgh would be able to share a sample.

tarekgh commented 1 month ago

Here is the sample how to create the gpt-4o.

const string ENDOFTEXT = "<|endoftext|>";
const string ENDOFPROMPT = "<|endofprompt|>";
Dictionary<string, int> specialTokens = new() 
{ 
    { ENDOFTEXT,   199999 },
    { ENDOFPROMPT, 200018 }
};
string regexPattern = @"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n/]*|\s*[\r\n]+|\s+(?!\S)|\s+";
Regex regex = new Regex(regexPattern, RegexOptions.Compiled);
HttpClient httpClient = new HttpClient();
Tiktoken tiktoken= await Tiktoken.CreateAsync(await httpClient.GetStreamAsync(@"https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"), specialTokens);
Tokenizer gpt4o = new Tokenizer(tiktoken, new TiktokenPreTokenizer(regex, specialTokens));

gpt4o.EncodeToIds("Hello, World!<|endoftext|>").ToList().ForEach(Console.WriteLine);

Note, this is using the library version:

    <PackageReference Include="Microsoft.ML.Tokenizers" Version="0.22.0-preview.24179.1" />

I didn't do deep testing as https://platform.openai.com/tokenizer didn't enable this new model.

tarekgh commented 1 month ago

CC @ericstj @luisquintanilla

tarekgh commented 1 month ago

This change is now published to the NuGet https://www.nuget.org/packages/Microsoft.ML.Tokenizers/0.22.0-preview.24271.1