dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Backport tokenizer changes to Release/4.0 #7292

Closed tarekgh closed 1 week ago

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 79.61477% with 127 lines in your changes missing coverage. Please review.

Project coverage is 68.87%. Comparing base (a9b4212) to head (0613779). Report is 1 commits behind head on release/4.0.

Files with missing lines Patch % Lines
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs 57.30% 48 Missing and 28 partials :warning:
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs 67.24% 10 Missing and 9 partials :warning:
...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs 60.00% 7 Missing and 5 partials :warning:
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 85.71% 4 Missing and 1 partial :warning:
src/Microsoft.ML.Tokenizers/Model/Phi2Tokenizer.cs 0.00% 4 Missing :warning:
src/Microsoft.ML.Tokenizers/Tokenizer.cs 75.00% 4 Missing :warning:
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 89.65% 2 Missing and 1 partial :warning:
...oft.ML.Tokenizers/Model/EnglishRobertaTokenizer.cs 90.90% 1 Missing :warning:
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs 88.88% 0 Missing and 1 partial :warning:
test/Microsoft.ML.Tokenizers.Tests/CodeGenTests.cs 98.93% 0 Missing and 1 partial :warning:
... and 1 more
Additional details and impacted files ```diff @@ Coverage Diff @@ ## release/4.0 #7292 +/- ## =============================================== - Coverage 68.87% 68.87% -0.01% =============================================== Files 1467 1469 +2 Lines 273955 273989 +34 Branches 28380 28389 +9 =============================================== + Hits 188697 188710 +13 - Misses 77946 77972 +26 + Partials 7312 7307 -5 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.87% <79.61%> (-0.01%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.33% <69.66%> (-0.01%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `89.18% <99.05%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files with missing lines](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.ML.Tokenizers/Model/BertOptions.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FBertOptions.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0JlcnRPcHRpb25zLmNz) | `100.00% <100.00%> (ø)` | | | [...rc/Microsoft.ML.Tokenizers/Model/LlamaTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FLlamaTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0xsYW1hVG9rZW5pemVyLmNz) | `59.09% <ø> (ø)` | | | [...Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FTiktokenTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuVG9rZW5pemVyLmNz) | `78.28% <100.00%> (ø)` | | | [.../Microsoft.ML.Tokenizers/Model/WordPieceOptions.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FWordPieceOptions.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1dvcmRQaWVjZU9wdGlvbnMuY3M=) | `100.00% <100.00%> (ø)` | | | [...crosoft.ML.Tokenizers/Normalizer/BertNormalizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FNormalizer%2FBertNormalizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL05vcm1hbGl6ZXIvQmVydE5vcm1hbGl6ZXIuY3M=) | `62.85% <100.00%> (ø)` | | | [...ft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FPreTokenizer%2FRegexPreTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9SZWdleFByZVRva2VuaXplci5jcw==) | `87.23% <100.00%> (ø)` | | | [src/Microsoft.ML.TorchSharp/NasBert/NerTrainer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?src=pr&el=tree&filepath=src%2FMicrosoft.ML.TorchSharp%2FNasBert%2FNerTrainer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub3JjaFNoYXJwL05hc0JlcnQvTmVyVHJhaW5lci5jcw==) | `91.10% <100.00%> (ø)` | | | [...oft.ML.Tokenizers.Data.Tests/TokenizerDataTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Data.Tests%2FTokenizerDataTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5EYXRhLlRlc3RzL1Rva2VuaXplckRhdGFUZXN0cy5jcw==) | `100.00% <ø> (ø)` | | | [...icrosoft.ML.Tokenizers.Tests/BertTokenizerTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FBertTokenizerTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CZXJ0VG9rZW5pemVyVGVzdHMuY3M=) | `100.00% <100.00%> (ø)` | | | [test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FBpeTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CcGVUZXN0cy5jcw==) | `100.00% <100.00%> (ø)` | | | ... and [17 more](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | | ... and [11 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7292/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)