dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9k stars 1.88k forks source link

TextFeaturizingEstimator.Options KeepNumbers and KeepPunctuations not exportable to ONNX? #6317

Open jonathanpeppers opened 2 years ago

jonathanpeppers commented 2 years ago

System Information (please complete the following information):

<PackageReference Update="Microsoft.ML" Version="2.0.0-preview.22410.1" />
<PackageReference Update="Microsoft.ML.OnnxConverter" Version="0.20.0-preview.22410.1" />

Describe the bug

As seen here: https://github.com/jonathanpeppers/inclusive-code-reviews-ml/pull/29#discussion_r944879120

A pipeline such as:

            var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey("isnegative", "isnegative")
                                      .Append(mlContext.Transforms.Text.FeaturizeText("text_tf", new TextFeaturizingEstimator.Options
                                      {
                                          //NOTE: not exportable to ONNX
                                          KeepNumbers = false,
                                          KeepPunctuations = false,
                                          // NOTE: these work
                                          KeepDiacritics = true,
                                          CaseMode = TextNormalizingEstimator.CaseMode.Lower,
                                      }, "text"))

Hits an exception such as:

Unhandled exception. System.Collections.Generic.KeyNotFoundException: The given key 'text_TextNormalizer' was not present in the dictionary.
   at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
   at Microsoft.ML.Model.OnnxConverter.OnnxContextImpl.GetVariableName(String colName)
   at Microsoft.ML.Transforms.Text.WordTokenizingTransformer.Mapper.SaveAsOnnx(OnnxContext ctx)
   at Microsoft.ML.Data.RowToRowMapperTransform.Microsoft.ML.Model.OnnxConverter.ISaveAsOnnx.SaveAsOnnx(OnnxContext ctx)
   at Microsoft.ML.Model.OnnxConverter.SaveOnnxCommand.ConvertTransformListToOnnxModel(OnnxContextImpl ctx, IChannel ch, IDataView inputData, IDataView outputData, LinkedList`1 transforms, HashSet`1 inputColumnNamesToDrop, HashSet`1 outputColumnNamesToDrop)
   at Microsoft.ML.OnnxExportExtensions.ConvertToOnnxProtobufCore(IHostEnvironment env, OnnxContextImpl ctx, ITransformer transform, IDataView inputData, String[] outputColumnNamesToKeep)
   at Microsoft.ML.OnnxExportExtensions.ConvertToOnnxProtobuf(ModelOperationsCatalog catalog, ITransformer transfor   at Microsoft.ML.OnnxExportExtensions.ConvertToOnnx(ModelOperationsCatalog catalog, ITransformer transform, IDataView inputData, Stream stream)
   at InclusiveCodeReviews.ConsoleApp.ModelBuilder.SaveModel(MLContext mlContext, IDataView dataView, ITransformer 
mlModel, String modelRelativePath, DataViewSchema modelInputSchema) in C:\src\inclusive-code-reviews-ml\ml.net\InclusiveCodeReviews.ConsoleApp\ModelBuilder.cs:line 102
   at InclusiveCodeReviews.ConsoleApp.ModelBuilder.CreateModel() in C:\src\inclusive-code-reviews-ml\ml.net\InclusiveCodeReviews.ConsoleApp\ModelBuilder.cs:line 49
   at Program.<Main>$(String[] args) in C:\src\inclusive-code-reviews-ml\ml.net\InclusiveCodeReviews.Convert\Program.cs:line 4

To Reproduce

Steps to reproduce the behavior:

  1. Run this project:

https://github.com/jonathanpeppers/inclusive-code-reviews-ml/tree/main/ml.net/InclusiveCodeReviews.Convert

  1. Uncomment these two lines:

https://github.com/jonathanpeppers/inclusive-code-reviews-ml/blob/486f7737174702233825ceddf28adb5cc7912f43/ml.net/InclusiveCodeReviews.ConsoleApp/ModelBuilder.cs#L59-L61

Expected behavior

In particular, we want to use KeepPunctuations=false and export to ONNX.

Screenshots, Code, Sample Projects

See above.

luisquintanilla commented 1 year ago

@michaelgsharp I think I remember you on working on something related to this. Did those changes affect this issue?