dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

Address the feedback on the tokenizer's library #7024

Closed tarekgh closed 4 months ago

tarekgh commented 4 months ago

This fix address the feedback reported in the issues:

tarekgh commented 4 months ago

CC @ericstj @michaelgsharp @luisquintanilla @LittleLittleCloud

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 76.65953% with 109 lines in your changes are missing coverage. Please review.

Project coverage is 68.79%. Comparing base (4b89d98) to head (1ad157f).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7024 +/- ## ========================================== - Coverage 68.83% 68.79% -0.04% ========================================== Files 1258 1254 -4 Lines 250672 250204 -468 Branches 25615 25529 -86 ========================================== - Hits 172547 172125 -422 + Misses 71493 71468 -25 + Partials 6632 6611 -21 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.79% <76.65%> (-0.04%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.22% <66.24%> (-0.05%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.50% <98.66%> (-0.07%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.ML.Tokenizers/EncodingResult.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL0VuY29kaW5nUmVzdWx0LmNz) | `98.41% <100.00%> (ø)` | | | [src/Microsoft.ML.Tokenizers/Model/Word.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1dvcmQuY3M=) | `58.75% <100.00%> (-25.63%)` | :arrow_down: | | [...ft.ML.Tokenizers/Normalizer/LowerCaseNormalizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL05vcm1hbGl6ZXIvTG93ZXJDYXNlTm9ybWFsaXplci5jcw==) | `100.00% <100.00%> (ø)` | | | [...ft.ML.Tokenizers/Normalizer/UpperCaseNormalizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL05vcm1hbGl6ZXIvVXBwZXJDYXNlTm9ybWFsaXplci5jcw==) | `100.00% <100.00%> (ø)` | | | [...ML.Tokenizers/PreTokenizer/TikTokenPreTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9UaWtUb2tlblByZVRva2VuaXplci5jcw==) | `90.24% <100.00%> (ø)` | | | [...Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9XaGl0ZXNwYWNlLmNz) | `100.00% <100.00%> (ø)` | | | [src/Microsoft.ML.Tokenizers/Token.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuLmNz) | `100.00% <100.00%> (ø)` | | | [...c/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL0J5dGVQYWlyRW5jb2Rlci5jcw==) | `88.23% <ø> (ø)` | | | [...ft.ML.TorchSharp/Extensions/TokenizerExtensions.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub3JjaFNoYXJwL0V4dGVuc2lvbnMvVG9rZW5pemVyRXh0ZW5zaW9ucy5jcw==) | `87.50% <100.00%> (ø)` | | | [src/Microsoft.ML.TorchSharp/NasBert/NerTrainer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub3JjaFNoYXJwL05hc0JlcnQvTmVyVHJhaW5lci5jcw==) | `91.10% <100.00%> (ø)` | | | ... and [14 more](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | | ... and [7 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7024/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)