dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.89k forks source link

Train binary classification with text label #2826

Closed daholste closed 4 years ago

daholste commented 5 years ago

@justinormont points out (https://github.com/dotnet/machinelearning-automl/issues/255) :

Key type is needed for binary classification learners:

  • Dataset w/ text labels (as seen here)
  • Datasets w/ missing labels -- BL no longer supports NA (changed in dotnet/machinelearning#673)

When the "Label" column is text, calling


var pipeline = mlContext.Transforms.Conversion.MapValueToKey("Label", "Label");
var trainer = mlContext.BinaryClassification.Trainers.LightGbm(labelColumnName: "Label", featureColumnName: "Features");
var trainingPipeline = pipeline.Append(trainer);
var crossValidationResults = mlContext.BinaryClassification.CrossValidateNonCalibrated(trainingDataView, trainingPipeline, numFolds: 5, labelColumn: "Label");

results in the exception

System.ArgumentOutOfRangeException
  HResult=0x80131502
  Message=Schema mismatch for label column '': expected Bool, got Key<U4>
  Source=Microsoft.ML.Data
  StackTrace:
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.CheckLabelCompatible(Column labelCol)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.CheckInputSchema(SchemaShape inputSchema)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.GetOutputSchema(SchemaShape inputSchema)
   at Microsoft.ML.Data.EstimatorChain`1.GetOutputSchema(SchemaShape inputSchema)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Microsoft.ML.TrainCatalogBase.<>c__DisplayClass7_0.<CrossValidateTrain>b__0(Int32 fold)
   at Microsoft.ML.TrainCatalogBase.CrossValidateTrain(IDataView data, IEstimator`1 estimator, Int32 numFolds, String samplingKeyColumn, Nullable`1 seed)
   at Microsoft.ML.BinaryClassificationCatalog.CrossValidateNonCalibrated(IDataView data, IEstimator`1 estimator, Int32 numFolds, String labelColumn, String samplingKeyColumn, Nullable`1 seed)
   at DogFruitNLP_14KB_735_rows_BinaryClassification.Program.BuildTrainEvaluateAndSaveModel(MLContext mlContext) in C:\AutoMLDotNet\bin\AnyCPU.Debug\mlnet\netcoreapp2.1\DogFruitNLP_14KB_735_rows_BinaryClassification\Program.cs:line 74

Would you have any recommendation for handling these kinds of scenarios?

Ivanidzo4ka commented 5 years ago

For now plan is following: Binary classification would support only boolean labels. If your data contains missing values -> load it as float or text and either filter it, or create mapping from this values to boolean. Float to boolean conversion should start work after this PR: https://github.com/dotnet/machinelearning/pull/2804

Text labels, I think we currently support 'True' and 'False' values in text loader as boolean values. For any other stuff like 'Positive', 'Negative', 'Cool', 'Not cool' you right now need to implement custom mapping or ValueMap

daholste commented 5 years ago

Thanks, @Ivanidzo4ka !

Float to boolean conversion should start work after this PR: #2804

Do you have any plans for key to Boolean conversion? This would help from our side

Ivanidzo4ka commented 5 years ago

That can be quite tricky. We can convert key to it's original type, but to specific type is feels somewhat weird. Key is basically a runtime build dictionary. It doesn't make much sense for me to cast dictionary which can contain whatever you want to boolean.

Why you need this conversion?

daholste commented 5 years ago

If a dataset has a text label with only 2 values, we want to do something like:

mlContext.Transforms.Conversion.MapValueToKey("Label")
         .Append(mlContext.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Boolean))
         .Append(mlContext.BinaryClassification.Trainers.LightGbm())

I noticed that

mlContext.Transforms.Conversion.MapValueToKey("Label")
         .Append(mlContext.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Single))

converts a key type to a float? Is this correct? If so, after your PR (https://github.com/dotnet/machinelearning/pull/2804), could we do something like

mlContext.Transforms.Conversion.MapValueToKey("Label")
         .Append(mlContext.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Single))
         .Append(mlContext.Transforms.Conversion.ConvertType("Label", outputKind: DataKind.Boolean))
         .Append(mlContext.BinaryClassification.Trainers.LightGbm())

? Does a better way come to mind to transform a text label to a Boolean form (that a binary classification trainer requires)? Thanks for your time!

rogancarr commented 5 years ago

@daholste Perhaps a custom transform would be in order? You can specify the exact mapping you want. This would let you map user-supplied values to booleans. Like @Ivanidzo4ka said, it's not clear a priori what value(s) would map to true or false.

You can define a custom transform like this:

// Define a custom function.
Action<ClassWithKey, ClassWithBool> convertLabelToBoolean = (input, output) =>
{
    output.Label = ConversionLogic(input.Label);
    // Copy the rest over too.
};

// Create a pipeline to execute the custom function.
var pipeline = mlContext.Transforms.CustomMapping(convertLabelToBoolean , null);
frank-dong-ms-zz commented 4 years ago

Close this issue as suggestion has already been given and not hear back from user for more than 1 year. Feel free to reopen if necessary.