dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

Optimize regexes used in tiktoken #7020

Closed stephentoub closed 4 months ago

stephentoub commented 4 months ago

This ports the tweaks in https://github.com/openai/tiktoken/pull/234. I noticed the differences as they also show up in the source for https://www.youtube.com/watch?v=zduSFxRajkE.

@tarekgh, if this conflicts with any of your changes, feel free to close this and I can re-make them after your changes land.

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 68.83%. Comparing base (a139371) to head (849097e).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7020 +/- ## ========================================== - Coverage 68.83% 68.83% -0.01% ========================================== Files 1258 1258 Lines 250674 250672 -2 Branches 25615 25615 ========================================== - Hits 172561 172543 -18 - Misses 71484 71495 +11 - Partials 6629 6634 +5 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7020/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7020/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.83% <100.00%> (-0.01%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7020/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.27% <100.00%> (-0.01%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7020/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.56% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7020?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [...rc/Microsoft.ML.Tokenizers/PreTokenizer/Roberta.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7020?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9Sb2JlcnRhLmNz) | `57.14% <100.00%> (-9.53%)` | :arrow_down: | | [...Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7020?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9XaGl0ZXNwYWNlLmNz) | `100.00% <ø> (ø)` | | | [src/Microsoft.ML.Tokenizers/Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7020?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplci5jcw==) | `82.64% <100.00%> (ø)` | | ... and [4 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7020/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)