Introducing Llama Tokenizer

tarekgh commented 3 months ago

This change is introducing the Llama tokenizer which is implemented as the SentencePiece Bpe model based on https://github.com/google/sentencepiece.

tarekgh commented 3 months ago

CC @stephentoub @ericstj @luisquintanilla @michaelgsharp @LittleLittleCloud

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 69.70443% with 369 lines in your changes are missing coverage. Please review.

Project coverage is 68.48%. Comparing base (8b483f4) to head (130583b). Report is 2 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #7078 +/- ## ========================================== - Coverage 68.82% 68.48% -0.35% ========================================== Files 1255 1262 +7 Lines 250358 254263 +3905 Branches 25550 26236 +686 ========================================== + Hits 172310 174121 +1811 - Misses 71438 73463 +2025 - Partials 6610 6679 +69 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.48% <69.70%> (-0.35%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `62.88% <61.88%> (-0.38%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `88.56% <100.00%> (+0.05%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [...kenizers/PreTokenizer/SentencePiecePreTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9TZW50ZW5jZVBpZWNlUHJlVG9rZW5pemVyLmNz) | `100.00% <100.00%> (ø)` | | | [src/Microsoft.ML.Tokenizers/SentencepieceModel.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1NlbnRlbmNlcGllY2VNb2RlbC5jcw==) | `20.38% <ø> (ø)` | | | [test/Microsoft.ML.Tokenizers.Tests/LlamaTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9MbGFtYVRlc3RzLmNz) | `100.00% <100.00%> (ø)` | | | [test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9UaXRva2VuVGVzdHMuY3M=) | `100.00% <100.00%> (ø)` | | | [...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL1N0cmluZ1NwYW5PcmRpbmFsS2V5LmNz) | `80.70% <0.00%> (-1.45%)` | :arrow_down: | | [...crosoft.ML.Tokenizers/Utils/Helpers.netstandard.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL0hlbHBlcnMubmV0c3RhbmRhcmQuY3M=) | `86.36% <90.90%> (+5.11%)` | :arrow_up: | | [src/Microsoft.ML.Tokenizers/Utils/Helpers.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL0hlbHBlcnMuY3M=) | `0.00% <0.00%> (ø)` | | | [src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuLmNz) | `70.00% <54.54%> (+2.27%)` | :arrow_up: | | [...rosoft.ML.Tokenizers/Normalizer/LlamaNormalizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL05vcm1hbGl6ZXIvTGxhbWFOb3JtYWxpemVyLmNz) | `71.23% <71.23%> (ø)` | | | [src/Microsoft.ML.Tokenizers/Tokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1Rva2VuaXplci5jcw==) | `76.42% <49.20%> (-7.99%)` | :arrow_down: | | ... and [2 more](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | | ... and [7 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7078/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)

tarekgh commented 3 months ago

@agocke any idea why build analysis is failing while all reported issues are known issues? I can force merge without waiting but I want to know if there is anything wrong here.

CC @ericstj

agocke commented 3 months ago

Seems like it should be passing to me. @AlitzelMendez any ideas?

ericstj commented 3 months ago

I think this is due to a problem with how machinelearning reports helix failures. https://github.com/dotnet/machinelearning/issues/7044

I noticed it but haven't gotten around to fixing it yet.

Edit: Maybe not since if that were the case I would have expected at least one thing showing up as not "known". Here everything is known. Seems like an issue with Build Analysis

tarekgh commented 3 months ago

Ok, I'll go ahead and force merge.

tarekgh commented 3 months ago

Maybe not since if that were the case I would have expected at least one thing showing up as not "known". Here everything is known. Seems like an issue with Build Analysis

I am noticing the build analysis is pointing at one leg only that failed and known. But looks nothing mentioned the rest of failed legs.

ericstj commented 3 months ago

I think I realize what's going on here. Just connected this to another issue we were talking about in chat. This build has many legs that were cancelled. I suspect that's why Build Analysis is staying red (it would be nice if that were made clear in UI). If legs are cancelled it means tests may not have even run, so it's not safe to treat the PR as passing. It looks to me like many of the build legs for this timed-out while waiting for helix to run the work-items. For some it looks like they eventually completed, for others they are still "waiting".

AlitzelMendez commented 3 months ago

Seems like it should be passing to me. @AlitzelMendez any ideas?

Hi Andy, this is an opt-in feature which is not activated for this repository, do we want to add this repository?

ericstj commented 3 months ago

@AlitzelMendez my team owns this repo - can you clarify what you mean by opt-in feature? What feature is opt-in?

dotnet / machinelearning

Introducing Llama Tokenizer #7078

Codecov Report