dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
264 stars 56 forks source link

Add method to generated code for model explainability (PFI). #2375

Open luisquintanilla opened 1 year ago

luisquintanilla commented 1 year ago

Related to #1031

beccamc commented 1 year ago

Working on the code to generate for the February release.

beccamc commented 1 year ago

Here is the code for regression scenario. The only metric that will need to be changed each time the code is generated is the "RSquared" value in the permutationFeatureImportance.Select statement. The three options are...

        /// <summary>
        /// Permutation feature importance (PFI) is a technique to determine the importance 
        /// of features in a trained machine learning model. PFI works by taking a labeled dataset, 
        /// choosing a feature, and permuting the values for that feature across all the examples, 
        /// so that each example now has a random value for the feature and the original values for all other features.
        /// The evaluation metric (e.g. R-squared) is then calculated for this modified dataset, 
        /// and the change in the evaluation metric from the original dataset is computed. 
        /// The larger the change in the evaluation metric, the more important the feature is to the model.
        /// 
        /// PFI typically takes a long time to compute, as the evaluation metric is calculated 
        /// many times to determine the importance of each feature. 
        /// 
        /// </summary>
        /// <param name="mlContext">The common context for all ML.NET operations.</param>
        /// <param name="trainData">IDataView used to evaluate the model.</param>
        /// <param name="model">Model to evaluate.</param>
        /// <param name="labelColumnName">Label column being predicted.</param>
        /// <returns>A list of each feature and its importance.</returns>
        public static List<Tuple<string, double>> CalculatePFI(MLContext mlContext, IDataView trainData, ITransformer model, string labelColumnName)
        {
            var preprocessedTrainData = model.Transform(trainData);

            var permutationFeatureImportance =
                mlContext.Regression
                .PermutationFeatureImportance(
                            model,
                            preprocessedTrainData,
                            labelColumnName: labelColumnName);

            var featureImportanceMetrics =
                 permutationFeatureImportance
                 .Select((kvp) => new { kvp.Key, kvp.Value.RSquared })
                 .OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));

            var featurePFI = new List<Tuple<string, double>>();
            foreach (var feature in featureImportanceMetrics)
            {
                var pfiValue = Math.Abs(feature.RSquared.Mean);
                featurePFI.Add(new Tuple<string, double>(feature.Key, pfiValue));
            }

            return featurePFI;
        }
luisquintanilla commented 1 year ago

Thanks @beccamc. How is the metric set in the generated code?

beccamc commented 1 year ago

The metric calculated in PFI should be the metric chosen in advanced training by the user.

luisquintanilla commented 1 year ago

@beccamc that's what I thought. Thanks for confirming.

beccamc commented 1 year ago

Here is the code for multiclass. Similarly, the "MicroAccuracy" should be whatever metric is used for training. The PFI is still running for me, but this compiles.

        public static List<Tuple<string, double>> CalculatePFI(MLContext mlContext, IDataView trainData, ITransformer model, string labelColumnName)
        {
            var preprocessedTrainData = model.Transform(trainData);

            var permutationFeatureImportance =
                mlContext.MulticlassClassification.PermutationFeatureImportance(model, trainData, permutationCount: 3);

            var featureImportanceMetrics =
                 permutationFeatureImportance
                 .Select((kvp) => new { kvp.Key, kvp.Value.MicroAccuracy })
                 .OrderByDescending(myFeatures => Math.Abs(myFeatures.MicroAccuracy.Mean));

            var featurePFI = new List<Tuple<string, double>>();
            foreach (var feature in featureImportanceMetrics)
            {
                var pfiValue = Math.Abs(feature.MicroAccuracy.Mean);
                featurePFI.Add(new Tuple<string, double>(feature.Key, pfiValue));
            }

            return featurePFI;
        }
beccamc commented 1 year ago

Note these are the same. We just need to update the train type (scenario) and the metric.

        var preprocessedTrainData = model.Transform(trainData);

        var permutationFeatureImportance =
            mlContext.<SCENARIO>
            .PermutationFeatureImportance(
                        model,
                        preprocessedTrainData,
                        labelColumnName: labelColumnName);

        var featureImportanceMetrics =
             permutationFeatureImportance
             .Select((kvp) => new { kvp.Key, kvp.Value.<METRIC> })
             .OrderByDescending(myFeatures => Math.Abs(myFeatures.<METRIC>.Mean));

        var featurePFI = new List<Tuple<string, double>>();
        foreach (var feature in featureImportanceMetrics)
        {
            var pfiValue = Math.Abs(feature.<METRIC>.Mean);
            featurePFI.Add(new Tuple<string, double>(feature.Key, pfiValue));
        }

        return featurePFI;
LanceElCamino commented 1 year ago

Thank you. It appears that BinaryClassificationCatalog doesn't offer PermutationFeatureImportance and the model needs to be cast to a .LastTransformer to work. When doing this, mlContext.BinaryClassification.PermutationFeatureImportance returns an Array of the feature contributions to the score. The Regression and Multiclass PFI methods return a Dictionary with the Key being the feature and Value being the contribution.

`static List<Tuple<string, double>> CalculatePFI(MLContext mlContext, IDataView trainData, ITransformer bestModel, string labelColumnName) { var preprocessedTrainData = bestModel.Transform(trainData);

    var linearPredictor = (bestModel as TransformerChain<ITransformer>).LastTransformer as ISingleFeaturePredictionTransformer<object>;

var permutationFeatureImportance =
    mlContext.BinaryClassification.PermutationFeatureImportance(linearPredictor, trainData, permutationCount: 3);`

How can we pair the features with their contributions when using a Binary Classification scenario?

beccamc commented 1 year ago

@LanceElCamino Does using PermutationFeatureImportanceNonCalibrated work?

var permutationFeatureImportance =
    mlContext.BinaryClassification.PermutationFeatureImportanceNonCalibrated(model, trainData, "Label")
beccamc commented 1 year ago

Action items...

  1. For Regression and Data Classification, add a new xxx.evaluate.cs file to the generated files.
  2. Add a PFI method similar to the following...

Below is the code for Regression. Two things are variable in the code, the scenario and the metric.

using Microsoft.ML.Data;
using System.Collections.Immutable;

namespace Regression_ConsoleExample
{
    public partial class Regression
    {
        /// <summary>
        /// Permutation feature importance (PFI) is a technique to determine the importance 
        /// of features in a trained machine learning model. PFI works by taking a labeled dataset, 
        /// choosing a feature, and permuting the values for that feature across all the examples, 
        /// so that each example now has a random value for the feature and the original values for all other features.
        /// The evaluation metric (e.g. R-squared) is then calculated for this modified dataset, 
        /// and the change in the evaluation metric from the original dataset is computed. 
        /// The larger the change in the evaluation metric, the more important the feature is to the model.
        /// 
        /// PFI typically takes a long time to compute, as the evaluation metric is calculated 
        /// many times to determine the importance of each feature. 
        /// 
        /// </summary>
        /// <param name="mlContext">The common context for all ML.NET operations.</param>
        /// <param name="trainData">IDataView used to evaluate the model.</param>
        /// <param name="model">Model to evaluate.</param>
        /// <param name="labelColumnName">Label column being predicted.</param>
        /// <returns>A list of each feature and its importance.</returns>
        public static List<Tuple<string, double>> CalculatePFI(MLContext mlContext, IDataView trainData, ITransformer model, string labelColumnName)
        {
            var preprocessedTrainData = model.Transform(trainData);

            var permutationFeatureImportance =
                mlContext.Regression
                .PermutationFeatureImportance(
                            model,
                            preprocessedTrainData,
                            labelColumnName: labelColumnName);

            var featureImportanceMetrics =
                 permutationFeatureImportance
                 .Select((kvp) => new { kvp.Key, kvp.Value.RSquared })
                 .OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));

            var featurePFI = new List<Tuple<string, double>>();
            foreach (var feature in featureImportanceMetrics)
            {
                var pfiValue = Math.Abs(feature.RSquared.Mean);
                featurePFI.Add(new Tuple<string, double>(feature.Key, pfiValue));
            }

            return featurePFI;
        }
    }
}
zewditu commented 1 year ago

@beccamc are we able to go with ....PermutationFeatureImportanceNonCalibrated

beccamc commented 1 year ago

Yes! For BinaryClassificaiton you should be able to use PermutationFeatureImportanceNonCalibrated. If I remember correctly, PermutationFeatureImportanceworks for Regression.

LanceElCamino commented 1 year ago

Thank you. How can we use this additional evaluate.cs code to see the PFI metrics in the model builder? I'm by no means a developer and can hack my way into running PFI in a console app for Binary Classification models but have to run a new experiment within the same app to do so. In simple terms, how do I use the evaluate.cs code generated from my experiments in the model builder to see the PFI metrics?

pjsgsy commented 4 months ago

I know a long time has passed, but, just to add my vote. it would be incredibly useful and time saving to be able to simply view the PFI results as part of the model builder process. It would greatly speed up development and feature selection.