Closed tarekgh closed 6 months ago
CC @stephentoub @michaelgsharp @ericstj @luisquintanilla
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 68.65%. Comparing base (
4d1a8c0
) to head (1c2dcea
).
This change is now published to the NuGet https://www.nuget.org/packages/Microsoft.ML.Tokenizers/0.22.0-preview.24271.1
Support the new OpenAI
Gpt-4o
tokenizer model.Notes
o200k_base.tiktoken
in the tokenizer assembly, just as we did with other Tiktoken models. After compression, the file size is1,188,794 bytes
. We plan to move the tokenizer vocabulary files into their own packages in the future.OpenAI didn't add tests yet for this new model. All tests added in this PR are manually generated. We can add more validation later when OpenAI enable this model in https://platform.openai.com/tokenizer.
Closes https://github.com/dotnet/machinelearning/issues/7154