dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.88k forks source link

Make most Tokenizer abstract methods virtual #7198

Closed stephentoub closed 3 months ago

stephentoub commented 4 months ago

All of the functionality in all of the methods can be implemented in terms of just a single Decode and EncodeToTokens set of methods. Only those two need to be abstract; everything else that was abstract can instead be virtual and implemented in terms of those.

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 90.54054% with 14 lines in your changes missing coverage. Please review.

Project coverage is 68.81%. Comparing base (0c2e82e) to head (4e2f9ed).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7198 +/- ## ========================================== - Coverage 69.07% 68.81% -0.26% ========================================== Files 1401 1267 -134 Lines 268516 259641 -8875 Branches 27812 26927 -885 ========================================== - Hits 185466 178673 -6793 + Misses 75890 74090 -1800 + Partials 7160 6878 -282 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.81% <90.54%> (-0.26%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.08% <82.27%> (-0.58%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.97% <100.00%> (+0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Common/tests/RetryHelper.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198?src=pr&el=tree&filepath=src%2FCommon%2Ftests%2FRetryHelper.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL0NvbW1vbi90ZXN0cy9SZXRyeUhlbHBlci5jcw==) | `18.96% <ø> (ø)` | | | [src/Microsoft.ML.Tokenizers/EncodeSettings.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FEncodeSettings.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL0VuY29kZVNldHRpbmdzLmNz) | `100.00% <ø> (ø)` | | | [src/Microsoft.ML.Tokenizers/EncodedToken.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FEncodedToken.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL0VuY29kZWRUb2tlbi5jcw==) | `100.00% <ø> (ø)` | | | [src/Microsoft.ML.Tokenizers/Model/Phi2Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FPhi2Tokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1BoaTJUb2tlbml6ZXIuY3M=) | `0.00% <ø> (ø)` | | | [...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FUtils%2FStringSpanOrdinalKey.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL1N0cmluZ1NwYW5PcmRpbmFsS2V5LmNz) | `76.59% <ø> (ø)` | | | [...icrosoft.ML.Tokenizers/Utils/ValueStringBuilder.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FUtils%2FValueStringBuilder.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL1ZhbHVlU3RyaW5nQnVpbGRlci5jcw==) | `47.01% <ø> (ø)` | | | [test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FBpeTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CcGVUZXN0cy5jcw==) | `100.00% <100.00%> (ø)` | | | [...st/Microsoft.ML.Tokenizers.Tests/TokenizerTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FTokenizerTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9Ub2tlbml6ZXJUZXN0cy5jcw==) | `99.29% <100.00%> (+0.62%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplci5jcw==) | `84.04% <82.27%> (-2.05%)` | :arrow_down: | ... and [141 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7198/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)