Closed tarekgh closed 3 months ago
CC @luisquintanilla @michaelgsharp
Attention: Patch coverage is 61.93548%
with 59 lines
in your changes are missing coverage. Please review.
Project coverage is 68.44%. Comparing base (
c980eaf
) to head (d73f957
).
I had a look at the format of these files - every row lists the token id which seems to be equivalent to line number (0-index). I bet we could further reduce the size of these by omitting the ID and assuming line number when absent.
I found the case in the file p50k_base.tiktoken that break this rule. I couldn't find similar case for the rest of the file. I can apply the optimization to the other files and see how much it will save.
IGdhemVk 50255
ICA= 50257
ICAg 50258
The new sizes are:
File | Compressed Size |
---|---|
cl100k_base.tiktoken.deflate | 513633 bytes |
gpt2.tiktoken.deflate | 237449 bytes |
p50k_base.tiktoken.deflate | 370930 bytes |
r50k_base.tiktoken.deflate | 237449 bytes |
Saving around 0.5 MB more after getting rid of the Ids.
IGdhemVk 50255 ICA= 50257 ICAg 50258
For this could we insert an extra empty line (or a comment line) to make it work?
For this could we insert an extra empty line (or a comment line) to make it work?
I tried your idea and worked fine. Thanks for the suggestion.
File | Compressed Size |
---|---|
cl100k_base.tiktoken.deflate | 513633 bytes |
gpt2.tiktoken.deflate | 237449 bytes |
p50k_base.tiktoken.deflate | 237527 bytes |
r50k_base.tiktoken.deflate | 237449 bytes |
Do we have sufficient test coverage, given the changes being made around vocab files? Should we add any tests like those in https://github.com/openai/tiktoken/pull/237/files ?
@stephentoub
Do we have sufficient test coverage, given the changes being made around vocab files? Should we add any tests like those in https://github.com/openai/tiktoken/pull/237/files ?
We already added test case which load the data from the embedded file and compare the it to the one loaded from the actual files (which has no modifications). Looks at Tiktoken test file in the method TestTokenizerUsingExternalVocab
.
Fixes https://github.com/dotnet/machinelearning/issues/7095
This change is embedding the Tiktoken tokenizer data files to avoid downloads at runtime. The files are embedded in compressed form to reduce the size and we decompress the data at runtime.