dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

AutoML Exception: System.ArgumentOutOfRangeException: 'Could not find input column 'SamplingKeyColumn' (Parameter 'inputSchema')' #5256

Closed ladislav-dolezal closed 4 years ago

ladislav-dolezal commented 4 years ago

Hi, I get an exception on prediction with AutoML. Before you run the Problem you need to reference two NuGet Packages Microsoft.ML and Microsoft.ML.AutoML Here ist the complete code to reproduce the error. Run in VS2019:

using Microsoft.ML;
using Microsoft.ML.AutoML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers.FastTree;
using System;
using System.Collections.Generic;
using System.Linq;
using static Microsoft.ML.DataOperationsCatalog;

namespace AutoML
{
    class Program
    {
        static void Main(string[] args)
        {
            var mlContext = new MLContext(seed: 0);

            var examples = GenerateData(100);

            var dataview = mlContext.Data.LoadFromEnumerable(examples);

            TrainTestData trainTestSplit = mlContext.Data.TrainTestSplit(dataview, testFraction: 0.1, samplingKeyColumnName: null);
            IDataView trainingData = trainTestSplit.TrainSet;
            IDataView testData = trainTestSplit.TestSet;

            ITransformer model = TrainRegresionAutoML(trainingData);
            ReportOnFeatureImportance(mlContext, model, dataview);            

            OutputData prediction = PredictRegresinAutoML<InputData,OutputData>(model, new InputData(){A = 6, B = 6});          
        }

        static ITransformer TrainRegresionAutoML(IDataView trainData)
        {
            var mlContext = new MLContext(seed: 0);

            var settings = new RegressionExperimentSettings
            {
                MaxExperimentTimeInSeconds = 10, // In Second
                OptimizingMetric = RegressionMetric.RSquared,
                CacheDirectory = null
            };

            var experiment = mlContext.Auto().CreateRegressionExperiment(settings);

            var model = experiment.Execute(trainData);            

            return model.BestRun.Model;
        }

        private static void ReportOnFeatureImportance(MLContext context, ITransformer model, IDataView data)
        {            
            // Need to cast from the ITransformer interface to gain access to the LastTransformer property.
            var typedModel = (TransformerChain<IPredictionTransformer<object>>)model;
            var modelParams = typedModel.LastTransformer.Model as FastTreeRegressionModelParameters;
            var weights = new VBuffer<float>();
            modelParams.GetFeatureWeights(ref weights);            
        }

        static TDst PredictRegresinAutoML<TSrc, TDst>(ITransformer model, TSrc inputData) 
            where TSrc : class
            where TDst : class, new()
        {
            var mlContext = new MLContext(seed: 0);

            var predictor = mlContext.Model.CreatePredictionEngine<TSrc, TDst>(model);
            return predictor.Predict(inputData);
        }      

        private static IEnumerable<InputData> GenerateData(int count,
            int seed = 0)

        {
            for (int i = 0; i < count; i++)
            {
                for (int ii = 0; ii < count; ii++)
                {
                    yield return new InputData
                    {
                        A = i,
                        B = ii,
                        Value = i * ii
                    };
                }
            }
        }       
    }

    public class InputData
    {     
        public float A { get; set; }

        public float B { get; set; }

        [ColumnName("Label")]
        public float Value { get; set; }
    }

    public class OutputData
    {
        [ColumnName("Score")]
        public float Result;
    }

    public class FeatureImportance
    {
        public string Name { get; set; }

        public double RSquaredMean { get; set; }

        public double CorrelationCoefficient { get; set; }
    }
}
antoniovs1029 commented 4 years ago

This seems a duplicate of https://github.com/dotnet/machinelearning/issues/4048 but that issue was closed because we didn't receive any dataset to reproduce that issue, so I think it's ok to have this new issue open.

Still, looking to what users mentioned there, it seems there's actually a bug in TrainTestSplit (and maybe CrossValidationSplit) when working with AutoML.

EDIT: I had asked for a sample dataset to repro the issue, but now I've realized that inside your sample code the dataset is generated, so no need for that. I was able to reproduce the issue, and this seems like a bug. Will investigate now.

antoniovs1029 commented 4 years ago

Hi, @ladodc . So this is a bug on ML.NET's mlContext.Data.TrainTestSplit() method, and I'll work on a solution. In the meantime, there are 2 main ways in which you can avoid receiving this exception:

Workround # 1 : Not splitting the dataset before running the AutoML experiment

You can simply use:


var dataview = mlContext.Data.LoadFromEnumerable(examples);
var model = TrainRegresionAutoML(dataview); // pass the loaded data without splitting.

Notice that this is valid, because the experiment.Execute(trainingData) you call on TrainRegresionAutoML will actually split the trainingData DataView that is passed to it (if the DataView has <15000 rows, it will split it into 2 sets, if it has > 15000 rows it will split it into 10 folds). So there was no need to split dataview in the first place, although it is still a bug that when using mlContext.Data.TrainTestSplit you get that exception when getting predictions from the model. By the way, notice that there are more Execute() overloads here, so you can choose which one to use based on how you want to split the data 😄 I believe none of these methods will throw the exception you're getting now.

Workaround # 2 : Drop the "SamplingKeyColumn" after splitting the data, but before passing it to train AutoML

            var dataview = mlContext.Data.LoadFromEnumerable(examples);

            TrainTestData trainTestSplit = mlContext.Data.TrainTestSplit(dataview, testFraction: 0.1, samplingKeyColumnName: null);
            IDataView trainingData = trainTestSplit.TrainSet;
            IDataView testData = trainTestSplit.TestSet;

            trainingData = mlContext.Transforms.DropColumns("SamplingKeyColumn").Fit(trainingData).Transform(trainingData);

            ITransformer model = TrainRegresionAutoML(trainingData);

I'll explain why this works in the section below.

PS (somewhat unrelated to your issue): notice that your ReportOnFeatureImportance method will only work if the model created by AutoML happens to be a FastTree regression model, else an exception will be thrown because the cast as FastTreeRegressionModelParameters won't work as expected. So workaround # 1 causes an exception on my computer because of this (simply not calling the method makes everything to work, including the prediction), while workaround # 2 works even when calling that method (because it happens to return a FastTree model). This isn't a bug on ML.NET, but simply that your ReportOnFeatureImportance method is making an assumption that isn't necessarily true (i.e. that AutoML will return a FastTree model)... whether AutoML returns such a model or not, depends on the data used for training, how is it splitted, and the many other parameters used by AutoML, in general users can't know in advance what kind of model is returned by an AutoML experiment.

Why this works / The cause of this issue

The mlContext.Data.TrainTestSplit() method (and actually also the mlContext.Data.CrossValidationSplit() method) create a column called "SamplingKeyColumn" in here (in some cases, such as when the original DataView already has a column named "SamplingKeyColumn", the new column will have a name such as "'temp_SamplingKeyColumn_000''... which is probably what happened on this other issue), this column is actually only meant to be used to split the data but is never dropped (and I think we should drop it automatically after doing the splits). So the trainingData and testData include the automatically created SamplingKeyColumn which wasn't there on the original dataview.

Then, when training the AutoML model with the default parameters (i.e. Execute(trainingData)), AutoML believes that the SamplingKeyColumn on trainingData is necessary for the model, and includes it in a Concatenate transformer. Then when you use the PredictionEngine with your trained model, it considers that you need the SamplingKeyColumn which isn't included in your InputData class and throws the exception.

So I believe the solution to this issue is simply changing mlContext.Data.TrainTestSplit() and mlContext.Data.CrossValidationSplit() to automatically drop the SamplingKeyColumn they created.

ladislav-dolezal commented 4 years ago

Hi Antonio, thanks for you response and your comments. Yes, I like the workaround with DropColumn and it works. And the future solution with automaticaly dropcolumn seems to be reasonable for me. My Problem is solved, that´s why I am closing this Issue. Thank you again for you rich comments, clarification and hints. Great!

antoniovs1029 commented 4 years ago

Hi, @ladodc . I'm glad to hear the workarounds I suggested have worked to fix your problem.

Since the actual issue is still there (i.e. TrainTesSplit() doesn't automatically removes the "SamplingKeyColumn") I'll actually reopen this issue to keep track of that problem until it gets fixed. Since it's a small change, I think I'll be able to fix it soon, though. 😄