dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

Tokenizer's Interfaces Cleanup #7001

Closed tarekgh closed 4 months ago

tarekgh commented 4 months ago

This update encompasses the following:

tarekgh commented 4 months ago

@michaelgsharp I appreciate it if you could review the changes. I have removed a couple of APIs you introduced earlier and provided a workaround for their usage. Thank you!

tarekgh commented 4 months ago

CC @luisquintanilla @stephentoub @ericstj @LittleLittleCloud

codecov[bot] commented 4 months ago

Codecov Report

Attention: 104 lines in your changes are missing coverage. Please review.

Comparison is base (64523e8) 68.81% compared to head (7c61933) 68.80%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7001 +/- ## ========================================== - Coverage 68.81% 68.80% -0.02% ========================================== Files 1258 1258 Lines 250477 250652 +175 Branches 25576 25602 +26 ========================================== + Hits 172377 172468 +91 - Misses 71473 71553 +80 - Partials 6627 6631 +4 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.80% <65.21%> (-0.02%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.27% <59.84%> (-0.02%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.44% <100.00%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.ML.Tokenizers/Model/Word.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1dvcmQuY3M=) | `84.37% <100.00%> (+0.62%)` | :arrow_up: | | [test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CcGVUZXN0cy5jcw==) | `100.00% <100.00%> (ø)` | | | [...crosoft.ML.Tokenizers.Tests/EnglishRobertaTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9FbmdsaXNoUm9iZXJ0YVRlc3RzLmNz) | `100.00% <100.00%> (ø)` | | | [test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9UaXRva2VuVGVzdHMuY3M=) | `100.00% <100.00%> (ø)` | | | [src/Microsoft.ML.TorchSharp/Roberta/QATrainer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub3JjaFNoYXJwL1JvYmVydGEvUUFUcmFpbmVyLmNz) | `78.37% <66.66%> (+0.07%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Model/Cache.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0NhY2hlLmNz) | `40.98% <50.00%> (-3.64%)` | :arrow_down: | | [src/Microsoft.ML.Tokenizers/Model/Model.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL01vZGVsLmNz) | `0.00% <0.00%> (-7.70%)` | :arrow_down: | | [src/Microsoft.ML.Tokenizers/Model/BPE.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0JQRS5jcw==) | `75.00% <81.25%> (+4.31%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplci5jcw==) | `83.40% <75.00%> (+1.93%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuLmNz) | `47.46% <59.09%> (+1.65%)` | :arrow_up: | | ... and [1 more](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | | ... and [2 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7001/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)