dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Support Gpt-4o tokenizer model #7157

Closed tarekgh closed 6 months ago

tarekgh commented 6 months ago

Support the new OpenAI Gpt-4o tokenizer model.

        Tokenizer GPT4o = Tokenizer.CreateTiktokenForModel("gpt-4o");
         text = "<|endoftext|>Hello ⭐ World<|endofprompt|>";

        IReadOnlyList<int> encoded = GPT4o.EncodeToIds(text);
        int idsCount = GPT4o.CountTokens(text);

Notes

Closes https://github.com/dotnet/machinelearning/issues/7154

tarekgh commented 6 months ago

CC @stephentoub @michaelgsharp @ericstj @luisquintanilla

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 68.65%. Comparing base (4d1a8c0) to head (1c2dcea).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7157 +/- ## ========================================== - Coverage 68.92% 68.65% -0.27% ========================================== Files 1396 1262 -134 Lines 266745 257767 -8978 Branches 27560 26660 -900 ========================================== - Hits 183849 176982 -6867 + Misses 75792 73972 -1820 + Partials 7104 6813 -291 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7157/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7157/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.65% <100.00%> (-0.27%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7157/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `62.94% <100.00%> (-0.59%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7157/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.85% <100.00%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7157?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7157?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FTiktoken.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuLmNz) | `75.84% <100.00%> (+0.14%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7157?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplci5jcw==) | `56.98% <100.00%> (+0.94%)` | :arrow_up: | | [test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7157?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FTitokenTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9UaXRva2VuVGVzdHMuY3M=) | `99.55% <100.00%> (+0.03%)` | :arrow_up: | ... and [138 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7157/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)
tarekgh commented 6 months ago

This change is now published to the NuGet https://www.nuget.org/packages/Microsoft.ML.Tokenizers/0.22.0-preview.24271.1