Tokenizer's APIs Update

tarekgh commented 7 months ago

Updating the Tokenizer's APIs:

Simplifying the APIs by merging the Model abstraction into the Tokenizer abstracted class.
Adding overloads to work with spans.
doing some clean up optimization and adding more tests.

tarekgh commented 7 months ago

@michaelgsharp please have a look as I have changed the tokenizer APIs and updated the TorchSharp code to work with the new APIs. Thanks!

codecov[bot] commented 7 months ago

Codecov Report

Attention: Patch coverage is 83.85650% with 216 lines in your changes are missing coverage. Please review.

Project coverage is 68.55%. Comparing base (07eb681) to head (53afe94). Report is 3 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #7128 +/- ## ========================================== + Coverage 68.48% 68.55% +0.06% ========================================== Files 1262 1259 -3 Lines 255113 255844 +731 Branches 26364 26434 +70 ========================================== + Hits 174722 175382 +660 - Misses 73682 73724 +42 - Partials 6709 6738 +29 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.55% <83.85%> (+0.06%)` | :arrow_up: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `62.89% <73.94%> (+0.02%)` | :arrow_up: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.70% <96.88%> (+0.09%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [...L.Tokenizers/Normalizer/SentencePieceNormalizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FNormalizer%2FSentencePieceNormalizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL05vcm1hbGl6ZXIvU2VudGVuY2VQaWVjZU5vcm1hbGl6ZXIuY3M=) | `82.89% <100.00%> (ø)` | | | [...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FPreTokenizer%2FPreTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9QcmVUb2tlbml6ZXIuY3M=) | `100.00% <100.00%> (+3.12%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Token.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FToken.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuLmNz) | `100.00% <100.00%> (ø)` | | | [src/Microsoft.ML.Tokenizers/Utils/PriorityQueue.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FUtils%2FPriorityQueue.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL1ByaW9yaXR5UXVldWUuY3M=) | `61.66% <100.00%> (+0.64%)` | :arrow_up: | | [...ft.ML.TorchSharp/Extensions/TokenizerExtensions.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?src=pr&el=tree&filepath=src%2FMicrosoft.ML.TorchSharp%2FExtensions%2FTokenizerExtensions.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub3JjaFNoYXJwL0V4dGVuc2lvbnMvVG9rZW5pemVyRXh0ZW5zaW9ucy5jcw==) | `86.95% <100.00%> (-0.55%)` | :arrow_down: | | [src/Microsoft.ML.TorchSharp/NasBert/NerTrainer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?src=pr&el=tree&filepath=src%2FMicrosoft.ML.TorchSharp%2FNasBert%2FNerTrainer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub3JjaFNoYXJwL05hc0JlcnQvTmVyVHJhaW5lci5jcw==) | `91.10% <100.00%> (ø)` | | | [src/Microsoft.ML.TorchSharp/Roberta/QATrainer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?src=pr&el=tree&filepath=src%2FMicrosoft.ML.TorchSharp%2FRoberta%2FQATrainer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub3JjaFNoYXJwL1JvYmVydGEvUUFUcmFpbmVyLmNz) | `78.52% <100.00%> (ø)` | | | [test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FBpeTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CcGVUZXN0cy5jcw==) | `100.00% <100.00%> (+8.69%)` | :arrow_up: | | [...crosoft.ML.Tokenizers.Tests/EnglishRobertaTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FEnglishRobertaTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9FbmdsaXNoUm9iZXJ0YVRlc3RzLmNz) | `97.40% <100.00%> (+1.68%)` | :arrow_up: | | [test/Microsoft.ML.Tokenizers.Tests/LlamaTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FLlamaTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9MbGFtYVRlc3RzLmNz) | `100.00% <100.00%> (ø)` | | | ... and [15 more](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | | ... and [2 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7128/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)

tarekgh commented 7 months ago

I addressed the feedback and did some more optimization too.

tarekgh commented 7 months ago

@ericstj @michaelgsharp please let me know if you have any more feedback or we are good to go. Thanks!

dotnet / machinelearning

Tokenizer's APIs Update #7128

Codecov Report