dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

First round of perf improvements for tiktoken #7012

Closed stephentoub closed 4 months ago

stephentoub commented 4 months ago
Before: Method Mean Allocated
CountTokensCached 3.677 s 4.82 GB
CountTokensUncached 2.309 s 3.03 GB
After: Method Mean Allocated
CountTokensCached 2.545 s 637.63 MB
CountTokensUncached 1.627 s 408.34 MB
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using Microsoft.ML.Tokenizers;

BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);

[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
    private Tokenizer _tokenizer;
    private string[] _tests;

    [GlobalSetup]
    public async Task Setup()
    {
        _tokenizer = await Tokenizer.CreateByModelNameAsync("gpt-3.5-turbo");
        using HttpClient client = new HttpClient();
        string text = string.Concat(Enumerable.Repeat(Poem, 8));
        _tests = new string[8192]; // LruCache size
        for (int i = 0; i < _tests.Length; i++)
        {
            _tests[i] = text.Substring(0, text.Length - i);
        }
    }

    [Benchmark]
    public int CountTokensCached()
    {
        int sum = 0;
        for (int i = 0; i < _tests.Length; i++)
        {
            sum += _tokenizer.CountTokens(_tests[0]); // reuse same input each time
        }
        return sum;
    }

    [Benchmark]
    public int CountTokensUncached()
    {
        int sum = 0;
        for (int i = 0; i < _tests.Length; i++)
        {
            sum += _tokenizer.CountTokens(_tests[i]); // change the input to defeat the cache
        }
        return sum;
    }

    private const string Poem = """
        **Paws of Joy**

        In the morning's tender light,
        When dew-kissed grass awaits the sun,
        There stirs a creature, full of might,
        A friend whose loyalty is never undone.

        **The Dog**, with eyes like galaxies,
        Wags its tail, a metronome of glee,
        Its heart a map of boundless territories,
        Guiding us through life's vast sea.

        **Furry sentinels**, guardians of our hearth,
        They chase their tails in playful mirth,
        Their barks a symphony of love and merriment,
        Echoing through the quiet moments we've spent.

        **Nose to ground**, they follow scents,
        Unraveling mysteries with fervent intent,
        From squirrel trails to forgotten dreams,
        They lead us to places we've never seen.

        **Golden retrievers** with hearts of gold,
        **Dachshunds** with determination untold,
        **Greyhounds** racing against the wind,
        Each breed a chapter in the story they've pinned.

        **Labradors** dive into lakes with glee,
        **Chihuahuas** strut like tiny royalty,
        **Huskies** howl at the moon's silver glow,
        And **puppies**, oh sweet puppies, steal the show.

        Their eyes speak of trust, unwavering and true,
        Their fur holds secrets whispered by the dew,
        In their presence, worries seem to fade,
        As they teach us the art of living unafraid.

        So here's to the dogs, our steadfast friends,
        Who mend our hearts and heal life's bends,
        May their tails forever wag, their noses explore,
        For in their love, we find solace evermore.
        """;
}
stephentoub commented 4 months ago

cc: @tarekgh

codecov[bot] commented 4 months ago

Codecov Report

Attention: 67 lines in your changes are missing coverage. Please review.

Comparison is base (4635a86) 68.80% compared to head (ed88215) 68.81%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7012 +/- ## ======================================= Coverage 68.80% 68.81% ======================================= Files 1258 1258 Lines 250652 250643 -9 Branches 25602 25606 +4 ======================================= Hits 172472 172472 + Misses 71548 71546 -2 + Partials 6632 6625 -7 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.81% <70.08%> (+<0.01%)` | :arrow_up: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.28% <70.08%> (+<0.01%)` | :arrow_up: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.44% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [...Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9XaGl0ZXNwYWNlLmNz) | `100.00% <100.00%> (ø)` | | | [src/Microsoft.ML.Tokenizers/TokenizerResult.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplclJlc3VsdC5jcw==) | `100.00% <100.00%> (ø)` | | | [...Microsoft.ML.Tokenizers/Utils/ByteArrayComparer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL0J5dGVBcnJheUNvbXBhcmVyLmNz) | `100.00% <100.00%> (+35.29%)` | :arrow_up: | | [...rc/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0VuZ2xpc2hSb2JlcnRhLmNz) | `67.36% <85.71%> (ø)` | | | [...rc/Microsoft.ML.Tokenizers/PreTokenizer/Roberta.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9Sb2JlcnRhLmNz) | `66.66% <80.00%> (+9.52%)` | :arrow_up: | | [...ML.Tokenizers/PreTokenizer/TikTokenPreTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9UaWtUb2tlblByZVRva2VuaXplci5jcw==) | `90.24% <94.73%> (+12.58%)` | :arrow_up: | | [...c/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL0J5dGVQYWlyRW5jb2Rlci5jcw==) | `94.82% <75.00%> (-0.42%)` | :arrow_down: | | [...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9QcmVUb2tlbml6ZXIuY3M=) | `94.44% <90.00%> (+8.73%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplci5jcw==) | `83.61% <88.88%> (+0.20%)` | :arrow_up: | | [...c/Microsoft.ML.Tokenizers/Utils/IListExtensions.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL0lMaXN0RXh0ZW5zaW9ucy5jcw==) | `25.00% <0.00%> (-16.67%)` | :arrow_down: | | ... and [1 more](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | | ... and [4 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7012/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)
tarekgh commented 4 months ago

Would this better handled in the csproj? include this file only when we target net8.0.


Refers to: src/Microsoft.ML.Tokenizers/AssemblyInfo.cs:8 in ed88215. [](commit_id = ed88215736a510fca410327e968233f2daa9d009, deletion_comment = False)