Prototype using spans in Model

stephentoub commented 4 months ago

@tarekgh, this isn't for merging, but it shows appx what I had in mind for incorporating spans into Model (I know you're currently revising the surface area, so take this with a grain of salt). This eliminates a majority of the remaining allocation that occurs when using Tokenizer.CountTokens/EncodeToIds, as it avoids allocating strings for each token that's already in the cache.

Feel free to crib liberally from the second commit and close this PR. Ignore the first commit, which I submitted separately.

codecov[bot] commented 4 months ago

Codecov Report

Attention: 69 lines in your changes are missing coverage. Please review.

Comparison is base (f976424) 68.81% compared to head (e78ab0f) 68.81%. Report is 6 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #7018 +/- ## ======================================= Coverage 68.81% 68.81% ======================================= Files 1258 1259 +1 Lines 250643 250665 +22 Branches 25606 25608 +2 ======================================= + Hits 172479 172501 +22 + Misses 71540 71534 -6 - Partials 6624 6630 +6 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.81% <62.50%> (+<0.01%)` | :arrow_up: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.28% <62.50%> (+<0.01%)` | :arrow_up: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.44% <ø> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [...rc/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0VuZ2xpc2hSb2JlcnRhLmNz) | `67.36% <100.00%> (ø)` | | | [...crosoft.ML.Tokenizers/Utils/Helpers.netstandard.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL0hlbHBlcnMubmV0c3RhbmRhcmQuY3M=) | `75.00% <100.00%> (+15.00%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Model/BPE.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0JQRS5jcw==) | `75.29% <75.00%> (+0.29%)` | :arrow_up: | | [...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL1N0cmluZ1NwYW5PcmRpbmFsS2V5LmNz) | `94.44% <94.44%> (ø)` | | | [src/Microsoft.ML.Tokenizers/Model/Model.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL01vZGVsLmNz) | `10.00% <50.00%> (+10.00%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Model/Cache.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0NhY2hlLmNz) | `75.00% <76.92%> (+34.01%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplci5jcw==) | `82.64% <52.17%> (-0.97%)` | :arrow_down: | | [src/Microsoft.ML.Tokenizers/Utils/LruCache.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL0xydUNhY2hlLmNz) | `77.77% <64.70%> (+11.11%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuLmNz) | `54.92% <48.00%> (-0.64%)` | :arrow_down: | ... and [6 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7018/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)

tarekgh commented 4 months ago

Closing this in favor of the following: https://github.com/dotnet/machinelearning/pull/7035

dotnet / machinelearning

Prototype using spans in Model #7018

Codecov Report