dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Expose Encoder and Decoder in TiktokenTokenizer #7314

Open razshare opened 6 days ago

razshare commented 6 days ago

Fixes #7313

We are excited to review your PR.

So we can do the best job, please check:

codecov[bot] commented 6 days ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 68.88%. Comparing base (5090327) to head (140aa42). Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7314 +/- ## ========================================== + Coverage 68.87% 68.88% +0.01% ========================================== Files 1470 1470 Lines 274005 274005 Branches 28403 28401 -2 ========================================== + Hits 188717 188754 +37 + Misses 77970 77936 -34 + Partials 7318 7315 -3 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7314/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7314/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.88% <100.00%> (+0.01%)` | :arrow_up: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7314/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.30% <100.00%> (+0.01%)` | :arrow_up: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7314/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `89.41% <100.00%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files with missing lines](https://app.codecov.io/gh/dotnet/machinelearning/pull/7314?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [...Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7314?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FTiktokenTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuVG9rZW5pemVyLmNz) | `78.28% <100.00%> (ø)` | | | [...est/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7314?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FTiktokenTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9UaWt0b2tlblRlc3RzLmNz) | `99.39% <100.00%> (+0.40%)` | :arrow_up: | ... and [8 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7314/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)

🚨 Try these New Features:

tarekgh commented 3 days ago

@razshare I replied on the issue. Let's discuss it there first before we continue here. Thanks a lot for your submission. I converted this PR to be draft for now till we finish the discussion.

razshare commented 2 days ago

@dotnet-policy-service agree