dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Final tokenizer's cleanup #7291

Closed tarekgh closed 1 week ago

codecov[bot] commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 79.38312% with 127 lines in your changes missing coverage. Please review.

Project coverage is 68.84%. Comparing base (5c50319) to head (31b97b8). Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/Microsoft.ML.Tokenizers/Model/BertTokenizer.cs 57.30% 48 Missing and 28 partials :warning:
.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs 67.24% 10 Missing and 9 partials :warning:
...icrosoft.ML.Tokenizers/Model/WordPieceTokenizer.cs 60.00% 7 Missing and 5 partials :warning:
...soft.ML.Tokenizers/Model/SentencePieceTokenizer.cs 85.71% 4 Missing and 1 partial :warning:
src/Microsoft.ML.Tokenizers/Model/Phi2Tokenizer.cs 0.00% 4 Missing :warning:
src/Microsoft.ML.Tokenizers/Tokenizer.cs 75.00% 4 Missing :warning:
src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs 89.65% 2 Missing and 1 partial :warning:
...oft.ML.Tokenizers/Model/EnglishRobertaTokenizer.cs 90.90% 1 Missing :warning:
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs 83.33% 0 Missing and 1 partial :warning:
test/Microsoft.ML.Tokenizers.Tests/CodeGenTests.cs 98.93% 0 Missing and 1 partial :warning:
... and 1 more
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7291 +/- ## ========================================== - Coverage 68.87% 68.84% -0.04% ========================================== Files 1467 1473 +6 Lines 273954 274159 +205 Branches 28380 28420 +40 ========================================== + Hits 188685 188737 +52 - Misses 77961 78112 +151 - Partials 7308 7310 +2 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.84% <79.38%> (-0.04%)` | :arrow_down: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.29% <69.21%> (-0.04%)` | :arrow_down: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `89.18% <99.04%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files with missing lines](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.ML.Tokenizers/Model/BertOptions.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FBertOptions.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0JlcnRPcHRpb25zLmNz) | `100.00% <100.00%> (ø)` | | | [...rc/Microsoft.ML.Tokenizers/Model/LlamaTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FLlamaTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0xsYW1hVG9rZW5pemVyLmNz) | `59.09% <ø> (ø)` | | | [...Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FTiktokenTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1Rpa3Rva2VuVG9rZW5pemVyLmNz) | `78.28% <100.00%> (ø)` | | | [.../Microsoft.ML.Tokenizers/Model/WordPieceOptions.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FWordPieceOptions.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL1dvcmRQaWVjZU9wdGlvbnMuY3M=) | `100.00% <100.00%> (ø)` | | | [...crosoft.ML.Tokenizers/Normalizer/BertNormalizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FNormalizer%2FBertNormalizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL05vcm1hbGl6ZXIvQmVydE5vcm1hbGl6ZXIuY3M=) | `62.85% <100.00%> (ø)` | | | [...ft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FPreTokenizer%2FRegexPreTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1ByZVRva2VuaXplci9SZWdleFByZVRva2VuaXplci5jcw==) | `87.23% <100.00%> (ø)` | | | [src/Microsoft.ML.TorchSharp/NasBert/NerTrainer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?src=pr&el=tree&filepath=src%2FMicrosoft.ML.TorchSharp%2FNasBert%2FNerTrainer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub3JjaFNoYXJwL05hc0JlcnQvTmVyVHJhaW5lci5jcw==) | `91.10% <100.00%> (ø)` | | | [...icrosoft.ML.Tokenizers.Tests/BertTokenizerTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FBertTokenizerTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CZXJ0VG9rZW5pemVyVGVzdHMuY3M=) | `100.00% <100.00%> (ø)` | | | [test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FBpeTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9CcGVUZXN0cy5jcw==) | `100.00% <100.00%> (ø)` | | | [...crosoft.ML.Tokenizers.Tests/EnglishRobertaTests.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?src=pr&el=tree&filepath=test%2FMicrosoft.ML.Tokenizers.Tests%2FEnglishRobertaTests.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-dGVzdC9NaWNyb3NvZnQuTUwuVG9rZW5pemVycy5UZXN0cy9FbmdsaXNoUm9iZXJ0YVRlc3RzLmNz) | `100.00% <100.00%> (ø)` | | | ... and [16 more](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | | ... and [14 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7291/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)
tarekgh commented 1 week ago

/backport to release/4.0

github-actions[bot] commented 1 week ago

Started backporting to release/4.0: https://github.com/dotnet/machinelearning/actions/runs/11747721749

github-actions[bot] commented 1 week ago

@tarekgh backporting to release/4.0 failed, the patch most likely resulted in conflicts:

$ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch

Applying: Final tokenizer's cleanup
Using index info to reconstruct a base tree...
M   src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
M   src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs
M   src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs
M   src/Microsoft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs
M   test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs
Falling back to patching base and 3-way merge...
Auto-merging test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs
Auto-merging src/Microsoft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs
CONFLICT (content): Merge conflict in src/Microsoft.ML.Tokenizers/PreTokenizer/RegexPreTokenizer.cs
Auto-merging src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs
Auto-merging src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs
Auto-merging src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
CONFLICT (content): Merge conflict in src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Final tokenizer's cleanup
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

github-actions[bot] commented 1 week ago

@tarekgh an error occurred while backporting to release/4.0, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.