Closed Xan-Kun closed 1 month ago
a) Please be respectful. b) Even the official openai/tiktoken repo only merged support for gpt-4o 3 days ago: https://github.com/openai/tiktoken/commit/9d01e5670ff50eb74cdb96406c7f3d9add0ae2f8. Integrated support will be coming here as well. c) Tokenizer does support the new model, it's just not as integrated as for the other models because the vocab file and associated regex aren't baked in. I expect @tarekgh would be able to share a sample.
Here is the sample how to create the gpt-4o.
const string ENDOFTEXT = "<|endoftext|>";
const string ENDOFPROMPT = "<|endofprompt|>";
Dictionary<string, int> specialTokens = new()
{
{ ENDOFTEXT, 199999 },
{ ENDOFPROMPT, 200018 }
};
string regexPattern = @"[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n/]*|\s*[\r\n]+|\s+(?!\S)|\s+";
Regex regex = new Regex(regexPattern, RegexOptions.Compiled);
HttpClient httpClient = new HttpClient();
Tiktoken tiktoken= await Tiktoken.CreateAsync(await httpClient.GetStreamAsync(@"https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"), specialTokens);
Tokenizer gpt4o = new Tokenizer(tiktoken, new TiktokenPreTokenizer(regex, specialTokens));
gpt4o.EncodeToIds("Hello, World!<|endoftext|>").ToList().ForEach(Console.WriteLine);
Note, this is using the library version:
<PackageReference Include="Microsoft.ML.Tokenizers" Version="0.22.0-preview.24179.1" />
I didn't do deep testing as https://platform.openai.com/tokenizer didn't enable this new model.
CC @ericstj @luisquintanilla
This change is now published to the NuGet https://www.nuget.org/packages/Microsoft.ML.Tokenizers/0.22.0-preview.24271.1
System Information (please complete the following information):
Describe the bug No way to tokenize gpt-4o strings!
To Reproduce Tokenize a string for gpt-4o
Expected behavior The most recent models are supported.