dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Fixes #7271 AOT for ML.Tokenizers #7272

Closed euju-ms closed 1 month ago

euju-ms commented 1 month ago

Fixes https://github.com/dotnet/machinelearning/issues/7271

This PR makes ML.Tokenizers project AOT compatible.

ML.Tokenizers is made to use SourceGenerationContext for deserializing Json.

I had to create a helper class Vocabulary in order to register a JsonConverter for it.

Before the change, we have following Aot warnings on the calls to Json.Deserialize:

  Microsoft.ML.Tokenizers failed with 8 error(s) (2.6s)
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\EnglishRobertaTokenizer.cs(173,25): error IL3050: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresDynamicCodeAttribute' can break functionality when AOT compiling. JSON serialization and deserialization might require types that cannot be statically analyzed and might need runtime code generation. Use System.Text.Json source generation for native AOT applications.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\EnglishRobertaTokenizer.cs(173,25): error IL2026: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresUnreferencedCodeAttribute' can break functionality when trimming application code. JSON serialization and deserialization might require types that cannot be statically analyzed. Use the overload that takes a JsonTypeInfo or JsonSerializerContext, or make sure all of the required types are preserved.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\BPETokenizer.cs(763,59): error IL3050: Using member 'System.Text.Json.JsonSerializer.DeserializeAsync<TValue>(Stream, JsonSerializerOptions, CancellationToken)' which has 'RequiresDynamicCodeAttribute' can break functionality when AOT compiling. JSON serialization and deserialization might require types that cannot be statically analyzed and might need runtime code generation. Use System.Text.Json source generation for native AOT applications.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\BPETokenizer.cs(764,53): error IL3050: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresDynamicCodeAttribute' can break functionality when AOT compiling. JSON serialization and deserialization might require types that cannot be statically analyzed and might need runtime code generation. Use System.Text.Json source generation for native AOT applications.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\BPETokenizer.cs(763,59): error IL2026: Using member 'System.Text.Json.JsonSerializer.DeserializeAsync<TValue>(Stream, JsonSerializerOptions, CancellationToken)' which has 'RequiresUnreferencedCodeAttribute' can break functionality when trimming application code. JSON serialization and deserialization might require types that cannot be statically analyzed. Use the overload that takes a JsonTypeInfo or JsonSerializerContext, or make sure all of the required types are preserved.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\BPETokenizer.cs(764,53): error IL2026: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresUnreferencedCodeAttribute' can break functionality when trimming application code. JSON serialization and deserialization might require types that cannot be statically analyzed. Use the overload that takes a JsonTypeInfo or JsonSerializerContext, or make sure all of the required types are preserved.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\CodeGenTokenizer.cs(1771,25): error IL3050: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresDynamicCodeAttribute' can break functionality when AOT compiling. JSON serialization and deserialization might require types that cannot be statically analyzed and might need runtime code generation. Use System.Text.Json source generation for native AOT applications.
    C:\Users\euju\RiderProjects\ml\src\Microsoft.ML.Tokenizers\Model\CodeGenTokenizer.cs(1771,25): error IL2026: Using member 'System.Text.Json.JsonSerializer.Deserialize<TValue>(Stream, JsonSerializerOptions)' which has 'RequiresUnreferencedCodeAttribute' can break functionality when trimming application code. JSON serialization and deserialization might require types that cannot be statically analyzed. Use the overload that takes a JsonTypeInfo or JsonSerializerContext, or make sure all of the required types are preserved.

After the change, warnings are no more :)

Note that netstandard2.0 framework is not Aot compatible as trimming is only supported for .NET 6 and later. Therefore, in order to test the compatibility for Microsoft.ML.Tokenizers project, you need to set the TargetFramework to net8.0.

Then you can add <PublishAot>true</PublishAot>.

e.g. Microsoft.ML.Tokenizers.csproj

  <PropertyGroup>
    <TargetFramework>net8.0</TargetFramework>
    <Nullable>enable</Nullable>
    <IsPackable>true</IsPackable>
    <PackageDescription>Microsoft.ML.Tokenizers contains the implmentation of the tokenization used in the NLP transforms.</PackageDescription>
    <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
    <PublishAot>true</PublishAot>
  </PropertyGroup>

Then building specifically for Microsoft.ML.Tokenizers should result in warnings if the project is not AOT compatible.

We are excited to review your PR.

So we can do the best job, please check:

euju-ms commented 1 month ago

@dotnet-policy-service agree company="Microsoft"

tarekgh commented 1 month ago

@eiriktsarpalis could you please have a quick look at this change. The change is just changing the json deserialization code to use the source generator.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 68.81%. Comparing base (823fc17) to head (5483e4f). Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs 50.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #7272 +/- ## ======================================= Coverage 68.81% 68.81% ======================================= Files 1461 1461 Lines 272405 272400 -5 Branches 28176 28176 ======================================= Hits 187442 187442 + Misses 77727 77725 -2 + Partials 7236 7233 -3 ``` | [Flag](https://app.codecov.io/gh/dotnet/machinelearning/pull/7272/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [Debug](https://app.codecov.io/gh/dotnet/machinelearning/pull/7272/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `68.81% <85.71%> (+<0.01%)` | :arrow_up: | | [production](https://app.codecov.io/gh/dotnet/machinelearning/pull/7272/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `63.30% <85.71%> (+<0.01%)` | :arrow_up: | | [test](https://app.codecov.io/gh/dotnet/machinelearning/pull/7272/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | `89.07% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files with missing lines](https://app.codecov.io/gh/dotnet/machinelearning/pull/7272?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet) | Coverage Δ | | |---|---|---| | [src/Microsoft.ML.Tokenizers/Model/BPETokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7272?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FBPETokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0JQRVRva2VuaXplci5jcw==) | `76.75% <100.00%> (-0.04%)` | :arrow_down: | | [.../Microsoft.ML.Tokenizers/Model/CodeGenTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7272?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FCodeGenTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0NvZGVHZW5Ub2tlbml6ZXIuY3M=) | `72.76% <100.00%> (-0.03%)` | :arrow_down: | | [...oft.ML.Tokenizers/Model/EnglishRobertaTokenizer.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7272?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FModel%2FEnglishRobertaTokenizer.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL01vZGVsL0VuZ2xpc2hSb2JlcnRhVG9rZW5pemVyLmNz) | `74.30% <100.00%> (-0.04%)` | :arrow_down: | | [...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs](https://app.codecov.io/gh/dotnet/machinelearning/pull/7272?src=pr&el=tree&filepath=src%2FMicrosoft.ML.Tokenizers%2FUtils%2FStringSpanOrdinalKey.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet#diff-c3JjL01pY3Jvc29mdC5NTC5Ub2tlbml6ZXJzL1V0aWxzL1N0cmluZ1NwYW5PcmRpbmFsS2V5LmNz) | `76.08% <50.00%> (-0.51%)` | :arrow_down: | ... and [3 files with indirect coverage changes](https://app.codecov.io/gh/dotnet/machinelearning/pull/7272/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=dotnet)
euju-ms commented 1 month ago

Thanks for the review :)

tarekgh commented 1 month ago

/ba-g unrelated infrastructure failure.