dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Reduce Tiktoken Creation Memory Allocation #7202

Closed tarekgh closed 4 months ago

tarekgh commented 4 months ago

CC @ericstj

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 71.42857% with 8 lines in your changes missing coverage. Please review.

Project coverage is 68.83%. Comparing base (34eb579) to head (0f27782).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7202 +/- ## ======================================= Coverage 68.82% 68.83% ======================================= Files 1267 1267 Lines 259804 259825 +21 Branches 26952 26956 +4 ======================================= + Hits 178818 178842 +24 + Misses 74105 74100 -5 - Partials 6881 6883 +2 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7202/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7202/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.83% <71.42%> (+<0.01%)` | :arrow_up: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7202/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.08% <46.66%> (+<0.01%)` | :arrow_up: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7202/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.99% <100.00%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7202?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7202?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FTitokenTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9UaXRva2VuVGVzdHMuY3M=) | `98.98% <100.00%> (+0.02%)` | :arrow_up: | | [...Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7202?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FTiktokenTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuVG9rZW5pemVyLmNz) | `77.83% <46.66%> (-0.66%)` | :arrow_down: | ... and [5 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7202/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)
ericstj commented 3 months ago

I see - we encode the final total capacity required to reduce allocations. I take it this creates a steady-state smaller size to the collection which is what we're after.

Is there any benefit in further reducing the allocation cost of loading the files - or are we OK with that since it's a local peak that'll get reclaimed by GC?

tarekgh commented 3 months ago

Is there any benefit in further reducing the allocation cost of loading the files - or are we OK with that since it's a local peak that'll get reclaimed by GC?

I think this is something we can look more later to see if we can reduce the allocation more but I am not seeing this is pressing issue that we need to address it now.