dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.95k stars 1.86k forks source link

Schema mismatch using AutoML API #6544

Open LanceElCamino opened 1 year ago

LanceElCamino commented 1 year ago

Windows 10 Microsoft.Ml 2.0 Micorsoft.Ml.AutoML 0.20.0

I receive this error :

System.ArgumentOutOfRangeException: 'Schema mismatch for label column 'NextDayClose': expected Single, got Boolean (Parameter 'labelCol')'

when running this code copied and pasted from the AutoML QuickStart sample:

// Initialize MLContext
using Microsoft.ML;
using Microsoft.ML.AutoML;
using Microsoft.ML.Data;
using static Microsoft.ML.DataOperationsCatalog;

// Initialize MLContext
MLContext ctx = new MLContext();

// Define data path
var dataPath = Path.GetFullPath(@"CSATest.csv");

// Infer column information
ColumnInferenceResults columnInference =
    ctx.Auto().InferColumns(dataPath, labelColumnName: "NextDayClose", groupColumns: false);

// Create text loader
TextLoader loader = ctx.Data.CreateTextLoader(columnInference.TextLoaderOptions);

// Load data into IDataView
IDataView data = loader.Load(dataPath);

// Split into train (80%), validation (20%) sets
TrainTestData trainValidationData = ctx.Data.TrainTestSplit(data, testFraction: 0.2);

//Define pipeline
SweepablePipeline pipeline =
    ctx.Auto().Featurizer(data, columnInformation: columnInference.ColumnInformation)
        .Append(ctx.Auto().MultiClassification(labelColumnName: columnInference.ColumnInformation.LabelColumnName));

// Create AutoML experiment
AutoMLExperiment experiment = ctx.Auto().CreateExperiment();

// Configure experiment
experiment
    .SetPipeline(pipeline)
    .SetMulticlassClassificationMetric(MulticlassClassificatioMetric.MicroAccuracy, labelColumn:columnInference.ColumnInformation.LabelColumnName)
    .SetTrainingTimeInSeconds(10)
    .SetDataset(trainValidationData);

// Log experiment trials
ctx.Log += (_, e) => {
    if (e.Source.Equals("AutoMLExperiment"))
    {
        Console.WriteLine(e.RawMessage);
    }
};

// Run experiment
TrialResult experimentResults = await experiment.RunAsync();

// Get best model
var model = experimentResults.Model;

Attached is the csv file being called.

CSATest.csv

Is there a way to override the data type that InferColumns is inferring? Looks like it's expecting a bool yet it's a single.

LanceElCamino commented 1 year ago

The label column "NextDayClose", was created to only have 2 values, a 1 or 0. Does the InferColumns method infer that data type as boolean since there are only 2 possible values being 1 and 0? If so I assume this would require a Binary Classification scenario instead of Multiclass. In the model builder I can select either scenario on this dataset and it will build the model based on that selection. Is there a way to change the data type of a column that InferColumns is inferring to be whatever we want (single, datetime, boolean, etc)?

michaelgsharp commented 1 year ago

@LittleLittleCloud

LittleLittleCloud commented 1 year ago

@LanceElCamino Featurizer doesn't do anything for label column. So if you want to use multiclass trainer for your dataset, you'll need to map your label column to key type using MapValueToKey transformer before fedding it to trainer. This is because multiclass trainer in ML.Net requires the label type to be a key type.

However, if you are certain that your label type only has two possible value I would suggest try binary classification instead. In that case you would need to change the trainer in your pipeline to use BinaryClassification and also change metric type in AutoMLExperiment to BinaryClassificationMetric

LittleLittleCloud commented 1 year ago

Is there a way to override the data type that InferColumns is inferring? Looks like it's expecting a bool yet it's a single.

Yep, Simply add or move the corresponding columns in the return result will work

luisquintanilla commented 1 year ago

Hi @LanceElCamino,

Did the suggestions from @LittleLittleCloud resolve your issue?

ghost commented 1 year ago

This issue has been marked needs-author-action and may be missing some important information.

ghost commented 1 year ago

This issue has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

LanceElCamino commented 1 year ago

Thanks. Binary Classification works for this scenario.