dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
264 stars 56 forks source link

Strange error about Probability #892

Closed CBrauer closed 3 years ago

CBrauer commented 4 years ago

System information

Microsoft Visual Studio Professional 2019 Version 16.6.3 Windows 10 Enterprise, Build 19041.329 ML.NET 1.5.0

Issue

I generated a console app with mlnet cli. I then modified the source code to use Cross Validation. I'm getting an error during execution. A screen capture of my running application is:

screen

I don't understand the error: "Probability column 'Probability' not found (Parameter 'schema')

Source code

I have put the complete solution on https://github.com/CBrauer/Test_CrossValidate.

Using Visual Studio, I searched for the word "Probability". It did not exist.

My ModelInput.cs is:

using System;
using Microsoft.ML.Data;

namespace Version3.Model {
  public class ModelOutput {
    [ColumnName("PredictedLabel")]
    public string Prediction { get; set; }
    public float[] Score { get; set; }
  }
}

By the way, you guys changed Score to be a float array. Please explain how we should interpret this array.

I put the complete solution at: https://github.com/CBrauer/Test_CrossValidate

Any help will be greatly appreciated.

Charles

frank-dong-ms-zz commented 4 years ago

@CBrauer Thanks for using ML.NET. I checked your code, based on the definition of trainer below, you should use MulticlassClassification.CrossValidate instead of BinaryClassification.CrossValidate at line 34-35 in ModelBuilder.cs when do cross validation:

var trainer = mlContext.MulticlassClassification.Trainers .OneVersusAll(ml_Context.BinaryClassification.Trainers.FastTree( new FastTreeBinaryTrainer.Options() { NumberOfLeaves = 123, MinimumExampleCountPerLeaf = 1, NumberOfTrees = 500, LearningRate = 0.13990435f, Shrinkage = 2.5920498f, LabelColumnName = "Altitude", FeatureColumnName = "Features" }), labelColumnName: "Altitude") .Append(ml_Context.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));

The error message you mentioned Probability column 'Probability' not found (Parameter 'schema') is because you are trying to do cross validation as BinaryClassification model on a MulticlassClassification model. Probability column is auto generated and append to output schema during model fit for BinaryClassification model only. If you change your trainer to something like below, you will see Probability column generated after model fit:

var trainer = ML.BinaryClassification.Trainers.FastTree( new FastTreeBinaryTrainer.Options() { NumberOfLeaves = 123, MinimumExampleCountPerLeaf = 1, NumberOfTrees = 500, LearningRate = 0.13990435f, Shrinkage = 2.5920498f, LabelColumnName = "Altitude", FeatureColumnName = "Features" });

Regarding below, did you mean Score property in ModelOutput class which should be auto generated? If so, I would like let model builder guys help answer this. @LittleLittleCloud @JakeRadMSFT

By the way, you guys changed Score to be a float array. Please explain how we should interpret this array.

Please let me know if anything is unclear or you have any further questions, thanks.

CBrauer commented 4 years ago

Thanks for you comments and help.

frank-dong-ms-zz commented 4 years ago

@CBrauer regarding below comments, could you please provide more details? Like what is the original type of Score, when this change happens (after upgrade ML.NET version or upgrade model builder version? Or any code change leads to this change?)?

By the way, you guys changed Score to be a float array. Please explain how we should interpret this array.

CBrauer commented 4 years ago

Yes, it happened after the upgrade to 1.5.1 The code was generated by the cli script:


mlnet classification^
 --dataset "H:\HedgeTools\Datasets\rocket-train-classify.csv"^
 --validation-dataset "H:\HedgeTools\Datasets\rocket-test-classify.csv"^
 --test-dataset "H:\HedgeTools\Datasets\rocket-test-classify.csv"^
 --label-col "Altitude"^
 --cache on^
 --has-header true^
 --train-time 3600

The generated code is:


//*****************************************************************************************
//*                                                                                       *
//* This is an auto-generated file by Microsoft ML.NET CLI (Command-Line Interface) tool. *
//*                                                                                       *
//*****************************************************************************************

using System;

using Microsoft.ML.Data;

namespace Version3.Model {
  public class ModelOutput {
    // ColumnName attribute is used to change the column name from
    // its default value, which is the name of the field.
    [ColumnName("PredictedLabel")]
    public String Prediction { get; set; }
    public float[] Score { get; set; }
  }
}

I assume that the Score array contains the probabilities of the prediction. Please look at the code on my GitHub account (see above)

Charles

JakeRadMSFT commented 4 years ago

@CBrauer the Score array is the probabilities of the different labels. This is definitely something that needs improvement but you can find some information here: https://github.com/dotnet/docs/issues/14265.

Unfortunately, there isn't a great way to get the list of labels from ML.NET. We have some code we use in model builder that we might be able to clean up and share... or maybe get it added to ML.NET

CBrauer commented 4 years ago

Thank you for your reply.