dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

Embed Tiktoken data files #7098

Closed tarekgh closed 3 months ago

tarekgh commented 3 months ago

Fixes https://github.com/dotnet/machinelearning/issues/7095

This change is embedding the Tiktoken tokenizer data files to avoid downloads at runtime. The files are embedded in compressed form to reduce the size and we decompress the data at runtime.

File Compressed Size
cl100k_base.tiktoken.zip 784541 bytes
gpt2.tiktoken.zip 370795 bytes
p50k_base.tiktoken.zip 370930 bytes
r50k_base.tiktoken.zip 370795 bytes
tarekgh commented 3 months ago

CC @luisquintanilla @michaelgsharp

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 61.93548% with 59 lines in your changes are missing coverage. Please review.

Project coverage is 68.44%. Comparing base (c980eaf) to head (d73f957).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7098 +/- ## ========================================== - Coverage 68.45% 68.44% -0.01% ========================================== Files 1262 1263 +1 Lines 254775 254834 +59 Branches 26320 26334 +14 ========================================== + Hits 174404 174422 +18 - Misses 73666 73701 +35 - Partials 6705 6711 +6 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7098/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7098/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.44% <61.93%> (-0.01%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7098/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `62.83% <46.15%> (-0.02%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7098/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.57% <94.11%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7098?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.ML.Tokenizers/Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7098?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplci5jcw==) | `78.14% <ø> (+0.93%)` | :arrow_up: | | [test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7098?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FTitokenTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9UaXRva2VuVGVzdHMuY3M=) | `100.00% <100.00%> (ø)` | | | [test/Microsoft.ML.Tokenizers.Tests/Utils.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7098?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FUtils.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9VdGlscy5jcw==) | `70.00% <80.00%> (+4.28%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7098?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FTiktoken.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuLmNz) | `69.29% <80.43%> (+0.91%)` | :arrow_up: | | [src/Common/tests/RetryHelper.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7098?src=pr&el=tree&filepath=src%2FCommon%2Ftests%2FRetryHelper.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL0NvbW1vbi90ZXN0cy9SZXRyeUhlbHBlci5jcw==) | `18.96% <18.96%> (ø)` | | ... and [6 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7098/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)
tarekgh commented 3 months ago

I had a look at the format of these files - every row lists the token id which seems to be equivalent to line number (0-index). I bet we could further reduce the size of these by omitting the ID and assuming line number when absent.

I found the case in the file p50k_base.tiktoken that break this rule. I couldn't find similar case for the rest of the file. I can apply the optimization to the other files and see how much it will save.

IGdhemVk 50255
ICA= 50257
ICAg 50258
tarekgh commented 3 months ago

The new sizes are:

File Compressed Size
cl100k_base.tiktoken.deflate 513633 bytes
gpt2.tiktoken.deflate 237449 bytes
p50k_base.tiktoken.deflate 370930 bytes
r50k_base.tiktoken.deflate 237449 bytes

Saving around 0.5 MB more after getting rid of the Ids.

ericstj commented 3 months ago

IGdhemVk 50255 ICA= 50257 ICAg 50258

For this could we insert an extra empty line (or a comment line) to make it work?

tarekgh commented 3 months ago

For this could we insert an extra empty line (or a comment line) to make it work?

I tried your idea and worked fine. Thanks for the suggestion.

File Compressed Size
cl100k_base.tiktoken.deflate 513633 bytes
gpt2.tiktoken.deflate 237449 bytes
p50k_base.tiktoken.deflate 237527 bytes
r50k_base.tiktoken.deflate 237449 bytes
stephentoub commented 3 months ago

Do we have sufficient test coverage, given the changes being made around vocab files? Should we add any tests like those in https://github.com/openai/tiktoken/pull/237/files ?

tarekgh commented 3 months ago

@stephentoub

Do we have sufficient test coverage, given the changes being made around vocab files? Should we add any tests like those in https://github.com/openai/tiktoken/pull/237/files ?

We already added test case which load the data from the embedded file and compare the it to the one loaded from the actual files (which has no modifications). Looks at Tiktoken test file in the method TestTokenizerUsingExternalVocab.