dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Introducing WordPiece and Bert tokenizers #7275

Closed tarekgh closed 1 month ago

tarekgh commented 1 month ago

CC @luisquintanilla @ericstj

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 84.06633% with 221 lines in your changes missing coverage. Please review.

Project coverage is 68.89%. Comparing base (f385b06) to head (e78d834). Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs 75.05% 82 Missing and 28 partials :warning:
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs 74.19% 58 Missing and 14 partials :warning:
...crosoft.ML.Tokenizers/Normalizer/BertNormalizer.cs 66.66% 26 Missing and 9 partials :warning:
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs 78.57% 2 Missing and 1 partial :warning:
src/Microsoft.ML.Tokenizers/EncodedToken.cs 0.00% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7275 +/- ## ========================================== + Coverage 68.80% 68.89% +0.08% ========================================== Files 1461 1466 +5 Lines 272400 273778 +1378 Branches 28176 28349 +173 ========================================== + Hits 187436 188606 +1170 - Misses 77729 77887 +158 - Partials 7235 7285 +50 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.89% <84.06%> (+0.08%)` | :arrow_up: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.34% <73.78%> (+0.04%)` | :arrow_up: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `89.17% <100.00%> (+0.10%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files with missing lines](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FBPETokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0JQRVRva2VuaXplci5jcw==) | `76.75% <100.00%> (ø)` | | | [...icrosoft.ML.Tokenizers.Tests/BertTokenizerTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FBertTokenizerTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CZXJ0VG9rZW5pemVyVGVzdHMuY3M=) | `100.00% <100.00%> (ø)` | | | [test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FBpeTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CcGVUZXN0cy5jcw==) | `100.00% <100.00%> (ø)` | | | [...Microsoft.ML.Tokenizers.Tests/PreTokenizerTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FPreTokenizerTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9QcmVUb2tlbml6ZXJUZXN0cy5jcw==) | `92.00% <100.00%> (+0.69%)` | :arrow_up: | | [...st/Microsoft.ML.Tokenizers.Tests/WordPieceTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FWordPieceTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9Xb3JkUGllY2VUZXN0cy5jcw==) | `100.00% <100.00%> (ø)` | | | [src/Microsoft.ML.Tokenizers/EncodedToken.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FEncodedToken.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL0VuY29kZWRUb2tlbi5jcw==) | `88.88% <0.00%> (-11.12%)` | :arrow_down: | | [...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FPreTokenizer%2FPreTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9QcmVUb2tlbml6ZXIuY3M=) | `92.68% <78.57%> (-7.32%)` | :arrow_down: | | [...crosoft.ML.Tokenizers/Normalizer/BertNormalizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FNormalizer%2FBertNormalizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL05vcm1hbGl6ZXIvQmVydE5vcm1hbGl6ZXIuY3M=) | `66.66% <66.66%> (ø)` | | | [src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FBertTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0JlcnRUb2tlbml6ZXIuY3M=) | `74.19% <74.19%> (ø)` | | | [...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FWordPieceTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1dvcmRQaWVjZVRva2VuaXplci5jcw==) | `75.05% <75.05%> (ø)` | | ... and [6 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7275/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)
tarekgh commented 1 month ago

/ba-g the failures are regarding libomp which known and @michaelgsharp currently working fixing them.

tarekgh commented 4 weeks ago

@stephentoub I'll address the feedback in another PR. Thanks!