Closed ladislav-dolezal closed 4 years ago
This seems a duplicate of https://github.com/dotnet/machinelearning/issues/4048 but that issue was closed because we didn't receive any dataset to reproduce that issue, so I think it's ok to have this new issue open.
Still, looking to what users mentioned there, it seems there's actually a bug in TrainTestSplit (and maybe CrossValidationSplit) when working with AutoML.
EDIT: I had asked for a sample dataset to repro the issue, but now I've realized that inside your sample code the dataset is generated, so no need for that. I was able to reproduce the issue, and this seems like a bug. Will investigate now.
Hi, @ladodc . So this is a bug on ML.NET's mlContext.Data.TrainTestSplit()
method, and I'll work on a solution. In the meantime, there are 2 main ways in which you can avoid receiving this exception:
You can simply use:
var dataview = mlContext.Data.LoadFromEnumerable(examples);
var model = TrainRegresionAutoML(dataview); // pass the loaded data without splitting.
Notice that this is valid, because the experiment.Execute(trainingData)
you call on TrainRegresionAutoML
will actually split the trainingData
DataView that is passed to it (if the DataView has <15000 rows, it will split it into 2 sets, if it has > 15000 rows it will split it into 10 folds). So there was no need to split dataview
in the first place, although it is still a bug that when using mlContext.Data.TrainTestSplit
you get that exception when getting predictions from the model. By the way, notice that there are more Execute() overloads here, so you can choose which one to use based on how you want to split the data 😄 I believe none of these methods will throw the exception you're getting now.
var dataview = mlContext.Data.LoadFromEnumerable(examples);
TrainTestData trainTestSplit = mlContext.Data.TrainTestSplit(dataview, testFraction: 0.1, samplingKeyColumnName: null);
IDataView trainingData = trainTestSplit.TrainSet;
IDataView testData = trainTestSplit.TestSet;
trainingData = mlContext.Transforms.DropColumns("SamplingKeyColumn").Fit(trainingData).Transform(trainingData);
ITransformer model = TrainRegresionAutoML(trainingData);
I'll explain why this works in the section below.
PS (somewhat unrelated to your issue): notice that your ReportOnFeatureImportance
method will only work if the model created by AutoML happens to be a FastTree regression model, else an exception will be thrown because the cast as FastTreeRegressionModelParameters
won't work as expected. So workaround # 1 causes an exception on my computer because of this (simply not calling the method makes everything to work, including the prediction), while workaround # 2 works even when calling that method (because it happens to return a FastTree model). This isn't a bug on ML.NET, but simply that your ReportOnFeatureImportance
method is making an assumption that isn't necessarily true (i.e. that AutoML will return a FastTree model)... whether AutoML returns such a model or not, depends on the data used for training, how is it splitted, and the many other parameters used by AutoML, in general users can't know in advance what kind of model is returned by an AutoML experiment.
The mlContext.Data.TrainTestSplit()
method (and actually also the mlContext.Data.CrossValidationSplit()
method) create a column called "SamplingKeyColumn" in here (in some cases, such as when the original DataView already has a column named "SamplingKeyColumn", the new column will have a name such as "'temp_SamplingKeyColumn_000''... which is probably what happened on this other issue), this column is actually only meant to be used to split the data but is never dropped (and I think we should drop it automatically after doing the splits). So the trainingData
and testData
include the automatically created SamplingKeyColumn
which wasn't there on the original dataview
.
Then, when training the AutoML model with the default parameters (i.e. Execute(trainingData)
), AutoML believes that the SamplingKeyColumn
on trainingData
is necessary for the model, and includes it in a Concatenate
transformer. Then when you use the PredictionEngine
with your trained model, it considers that you need the SamplingKeyColumn
which isn't included in your InputData
class and throws the exception.
So I believe the solution to this issue is simply changing mlContext.Data.TrainTestSplit()
and mlContext.Data.CrossValidationSplit()
to automatically drop the SamplingKeyColumn
they created.
Hi Antonio, thanks for you response and your comments. Yes, I like the workaround with DropColumn and it works. And the future solution with automaticaly dropcolumn seems to be reasonable for me. My Problem is solved, that´s why I am closing this Issue. Thank you again for you rich comments, clarification and hints. Great!
Hi, @ladodc . I'm glad to hear the workarounds I suggested have worked to fix your problem.
Since the actual issue is still there (i.e. TrainTesSplit()
doesn't automatically removes the "SamplingKeyColumn") I'll actually reopen this issue to keep track of that problem until it gets fixed. Since it's a small change, I think I'll be able to fix it soon, though. 😄
Hi, I get an exception on prediction with AutoML. Before you run the Problem you need to reference two NuGet Packages Microsoft.ML and Microsoft.ML.AutoML Here ist the complete code to reproduce the error. Run in VS2019: