dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

More tokenizer's APIs cleanup #7110

Closed tarekgh closed 2 months ago

tarekgh commented 2 months ago

This change include the following:

tarekgh commented 2 months ago

CC @ericstj @michaelgsharp

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 91.36364% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 68.49%. Comparing base (4d5317e) to head (780857b). Report is 5 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7110 +/- ## ========================================== + Coverage 68.44% 68.49% +0.04% ========================================== Files 1263 1262 -1 Lines 254834 255089 +255 Branches 26334 26358 +24 ========================================== + Hits 174429 174731 +302 + Misses 73695 73653 -42 + Partials 6710 6705 -5 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.49% <91.36%> (+0.04%)` | :arrow_up: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `62.88% <81.00%> (+0.04%)` | :arrow_up: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.60% <100.00%> (+0.02%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [.../Microsoft.ML.Tokenizers/Model/SentencePieceBpe.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FSentencePieceBpe.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1NlbnRlbmNlUGllY2VCcGUuY3M=) | `70.66% <ø> (+1.22%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FTiktoken.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuLmNz) | `70.37% <ø> (+1.08%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplci5jcw==) | `80.10% <100.00%> (+1.95%)` | :arrow_up: | | [test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FBpeTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CcGVUZXN0cy5jcw==) | `91.30% <100.00%> (+4.36%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Model/Model.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FModel.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL01vZGVsLmNz) | `16.17% <50.00%> (+0.79%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Model/BPE.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FBPE.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0JQRS5jcw==) | `71.21% <89.79%> (+9.29%)` | :arrow_up: | | [...icrosoft.ML.Tokenizers/Utils/ValueStringBuilder.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FUtils%2FValueStringBuilder.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL1ZhbHVlU3RyaW5nQnVpbGRlci5jcw==) | `42.24% <74.41%> (+10.62%)` | :arrow_up: | ... and [10 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7110/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)