dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.01k stars 1.88k forks source link

Add Span support in tokenizer's Model abstraction #7035

Closed tarekgh closed 7 months ago

tarekgh commented 7 months ago

This change is adding Span support to the Tokenizer's Model abstraction. This will help in performance and not restrict us in the future to use spans without to worry about more memory allocations.

tarekgh commented 7 months ago

CC @ericstj @luisquintanilla @michaelgsharp

codecov[bot] commented 7 months ago

Codecov Report

Attention: Patch coverage is 68.89764% with 79 lines in your changes are missing coverage. Please review.

Project coverage is 68.80%. Comparing base (d0aa2c2) to head (40a669f). Report is 3 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7035 +/- ## ========================================== + Coverage 68.79% 68.80% +0.01% ========================================== Files 1254 1255 +1 Lines 250204 250248 +44 Branches 25529 25533 +4 ========================================== + Hits 172127 172190 +63 + Misses 71467 71449 -18 + Partials 6610 6609 -1 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.80% <68.89%> (+0.01%)` | :arrow_up: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.24% <68.65%> (+0.01%)` | :arrow_up: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.50% <100.00%> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.ML.Tokenizers/Model/Cache.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0NhY2hlLmNz) | `96.66% <100.00%> (+48.72%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplci5jcw==) | `88.37% <100.00%> (ø)` | | | [...crosoft.ML.Tokenizers/Utils/Helpers.netstandard.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL0hlbHBlcnMubmV0c3RhbmRhcmQuY3M=) | `81.25% <100.00%> (+6.25%)` | :arrow_up: | | [test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CcGVUZXN0cy5jcw==) | `86.86% <100.00%> (ø)` | | | [...crosoft.ML.Tokenizers.Tests/EnglishRobertaTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9FbmdsaXNoUm9iZXJ0YVRlc3RzLmNz) | `95.68% <100.00%> (ø)` | | | [src/Microsoft.ML.Tokenizers/Model/Model.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL01vZGVsLmNz) | `38.46% <0.00%> (ø)` | | | [...rc/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0VuZ2xpc2hSb2JlcnRhLmNz) | `79.63% <93.47%> (+0.95%)` | :arrow_up: | | [...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL1N0cmluZ1NwYW5PcmRpbmFsS2V5LmNz) | `82.14% <82.14%> (ø)` | | | [src/Microsoft.ML.Tokenizers/Utils/LruCache.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL0xydUNhY2hlLmNz) | `73.68% <59.45%> (+7.01%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuLmNz) | `62.17% <51.28%> (+0.80%)` | :arrow_up: | | ... and [1 more](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | | ... and [3 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7035/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)