dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.88k forks source link

Troubles with CustomMappingEstimator after save/load because InputSchemaDefinition is not being saved. #3988

Open SkinnyMan32 opened 5 years ago

SkinnyMan32 commented 5 years ago

System information

Issue

I try to use CustomMapping with specified column names. It works fine, but after save/load model I get exception: System.ArgumentOutOfRangeException: "Could not find column 'Words'". 'Words' - the original name of property, not specified by me.

What am i doing wrong?

I can fix it by using only original names, it is not comfortable in some cases.

Source code

// --- Test method ---
var ml = new MLContext();
var descriptions = new[]
{
    new { Description = "Painted, Painting, Painter" }
};
var dataView = ml.Data.LoadFromEnumerable(descriptions);

var pipeline = ml.Transforms.Text.NormalizeText("Normalized", "Description")
    .Append(ml.Transforms.Text.TokenizeIntoWords("Tokens", "Normalized"))
     // (Extension method) CustomMapping with specified column names
    .Append(ml.Transforms.StemText("Stemmed", "Tokens"));

var model = pipeline.Fit(dataView);
var preview = model.Transform(dataView).Preview();  // everything is ok

// Save model
MemoryStream stream = new MemoryStream();
ml.Model.Save(model, dataView.Schema, stream);
stream.Position = 0;

// Load model in the new context
var ml2 = new MLContext();
// Register custom action
ml2.ComponentCatalog.RegisterAssembly(typeof(StemmerCustomAction).Assembly);
var loadedModel = ml2.Model.Load(stream, out var schema);

// Exception:
// System.ArgumentOutOfRangeException: "Could not find  column 'Words'
var preview2 = loadedModel.Transform(dataView).Preview();

//--- Classes ---

public class StemmerInput
{
    public string[] Words { get; set; }
}

public class StemmerOutput
{
    public string[] Stemmed { get; set; }
}

[CustomMappingFactoryAttribute("StemText")]
public class StemmerCustomAction : CustomMappingFactory<StemmerInput, StemmerOutput>
{
    public static void StemAction(StemmerInput input, StemmerOutput output)
    {
        var stemmer = new EnglishStemmer();
        output.Stemmed = new string[input.Words.Length];
        for (int i = 0; i < input.Words.Length; i++)
        {
            output.Stemmed[i] = stemmer.Stem(input.Words[i]);
        }
    }

    public override Action<StemmerInput, StemmerOutput> GetMapping() => StemAction;
}

static class StemmerTransformHelper
{
    public static CustomMappingEstimator<StemmerInput, StemmerOutput> StemText(this TransformsCatalog catalog,
        string outputColumnName, string inputColumnName = null)
    {
        var inputSchema = SchemaDefinition.Create(typeof(StemmerInput), SchemaDefinition.Direction.Read);       
        var outSchema = SchemaDefinition.Create(typeof(StemmerOutput), SchemaDefinition.Direction.Write);       
        // specify column names
        inputSchema[0].ColumnName = inputColumnName ?? outputColumnName;
        outSchema[0].ColumnName = outputColumnName;
        return catalog.CustomMapping(new StemmerCustomAction().GetMapping(), "StemText", inputSchema, outSchema);
    }
}
antoniovs1029 commented 4 years ago

What am i doing wrong?

I believe you are not doing anything wrong, but unfortunately I also believe that currently it's not possible to do what you want to do (i.e. using a loaded CustomMappingTransformer whose InputSchemaDefinition originally had a ColumnName which was different from its MemberName).

The problem is that when saving the CustomMappingTransformer, actually only its contractName gets saved, but the InputSchemaDefinition and AddedSchema aren't saved (see code). So all the information of those SchemaDefinitions is lost, and, in your particular case, when loading back the CustomMappingTransformer, the ColumnName of the InputSchemaDefinition defaults to the MemberName "Words" instead of the custom "Tokens" name you used when calling your StemText()extension method.

Even more, the current implementation of SchemaDefinition doesn't include methods to save objects of that class; and currently there are no plans of implementing them. So I am sad to say there's no immediate solution to the bug you've found.

On the other hand, as you probably know, there are workarounds (although I will still explain them here, in case somebody else find it useful in the future). The only way to make this work is to actually have a MemberName that matches the ColumnName from the beginning… so you could:

  1. Either rename your Words member in StemmerInput as Tokens

  2. Or change your "Tokens" strings in your pipeline definition to be "Words"

var pipeline = ml.Transforms.Text.NormalizeText("Normalized", "Description")
    .Append(ml.Transforms.Text.TokenizeIntoWords("Words", "Normalized"))
    .Append(ml.Transforms.StemText("Stemmed", "Words"));
antoniovs1029 commented 4 years ago

So even if a workaround was known for this issue, the underlying problem of not serializing the SchemaDefinition objects still remains, and it's a bug that should be fixed eventually.

Furthermore, if PR #4676 gets merged to add the StatefulCustomMappingTransformer, then this issue might also affect that.

I am also adding the "Breaking API change" label to this, in case this issue of serializing SchemaDefinitions isn't fixed before a new major ML.NET release gets planned. It might be worth it to explore the possibility of removing the SchemaDefinition inputSchemaDefinition = null, SchemaDefinition outputSchemaDefinition = null parameters on the CustomMappingmethod (link), to avoid users creating these SchemaDefinitionsand encountering this issue down the road.