dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

Tweak Tiktoken's BytePairEncode for improved perf #7017

Closed stephentoub closed 4 months ago

stephentoub commented 4 months ago
[Benchmark]
public int CountTokens() => _tokenizer.CountTokens(Poem);

with the same Poem as in https://github.com/dotnet/machinelearning/pull/7012, and setting the LruCache size to 0 in order to skip the cache and measure what's being changed here...

Before:

Method Mean Allocated
CountTokens 61.11 us 19.52 KB

After:

Method Mean Allocated
CountTokens 58.82 us 11.27 KB

cc: @tarekgh

codecov[bot] commented 4 months ago

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (f976424) 68.81% compared to head (b50995c) 68.81%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7017 +/- ## ========================================== - Coverage 68.81% 68.81% -0.01% ========================================== Files 1258 1258 Lines 250643 250653 +10 Branches 25606 25608 +2 ========================================== + Hits 172479 172480 +1 - Misses 71540 71546 +6 - Partials 6624 6627 +3 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7017/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7017/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.81% <80.76%> (-0.01%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7017/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.28% <80.76%> (-0.01%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7017/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.44% <ø> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7017?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [...c/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7017?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL0J5dGVQYWlyRW5jb2Rlci5jcw==) | `88.23% <80.76%> (-6.60%)` | :arrow_down: | ... and [3 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7017/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)