dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

Adding needed Tokenizer's APIs #7047

Closed tarekgh closed 3 months ago

tarekgh commented 3 months ago

Fixes https://github.com/dotnet/machinelearning/issues/7043

The change here is adding the following Tokenizer's APIs:

tarekgh commented 3 months ago

CC @michaelgsharp @LittleLittleCloud @luisquintanilla

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 85.09804% with 38 lines in your changes are missing coverage. Please review.

Project coverage is 68.82%. Comparing base (164fde0) to head (1d89506). Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7047 +/- ## ========================================== + Coverage 68.81% 68.82% +0.01% ========================================== Files 1255 1255 Lines 250248 250358 +110 Branches 25533 25550 +17 ========================================== + Hits 172197 172304 +107 + Misses 71442 71441 -1 - Partials 6609 6613 +4 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.82% <85.09%> (+0.01%)` | :arrow_up: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.25% <79.78%> (+<0.01%)` | :arrow_up: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.51% <100.00%> (+0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.ML.Tokenizers/Model/BPE.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0JQRS5jcw==) | `63.81% <ø> (+1.94%)` | :arrow_up: | | [...rc/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0VuZ2xpc2hSb2JlcnRhLmNz) | `79.63% <ø> (ø)` | | | [src/Microsoft.ML.Tokenizers/Model/Model.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL01vZGVsLmNz) | `38.46% <ø> (ø)` | | | [...rc/Microsoft.ML.Tokenizers/PreTokenizer/Roberta.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9Sb2JlcnRhLmNz) | `57.14% <100.00%> (ø)` | | | [test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CcGVUZXN0cy5jcw==) | `86.93% <100.00%> (+0.06%)` | :arrow_up: | | [...crosoft.ML.Tokenizers.Tests/EnglishRobertaTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9FbmdsaXNoUm9iZXJ0YVRlc3RzLmNz) | `95.71% <100.00%> (+0.03%)` | :arrow_up: | | [test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9UaXRva2VuVGVzdHMuY3M=) | `100.00% <100.00%> (ø)` | | | [...st/Microsoft.ML.Tokenizers.Tests/TokenizerTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9Ub2tlbml6ZXJUZXN0cy5jcw==) | `100.00% <100.00%> (ø)` | | | [src/Microsoft.ML.Tokenizers/Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplci5jcw==) | `84.41% <95.00%> (-3.96%)` | :arrow_down: | | [src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuLmNz) | `67.72% <77.84%> (+5.55%)` | :arrow_up: | ... and [9 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7047/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)
tarekgh commented 3 months ago

@stephentoub I have addressed all feedback, please let me know if you have any more feedback. Thanks!