dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

Q: Interpreting Feature PFI results #4637

Closed lefig closed 4 years ago

lefig commented 4 years ago

Hi all,

I have been doing a little deep dive into some of my models in order to understand a little more about feature relevance. My results for running feature explanatory analysis is as follows for bin classification:

2020-01-08 11:34:03.813 +00:00 [INF] BinaryFastTreeParameters 2020-01-08 11:34:03.815 +00:00 [INF] Bias: 0 2020-01-08 11:34:03.816 +00:00 [INF] Feature Weights: 2020-01-08 11:34:03.843 +00:00 [INF] Feature: CloseWeight: 0.1089412 2020-01-08 11:34:03.931 +00:00 [INF] Feature: OpenWeight: 0.3691619 2020-01-08 11:34:03.932 +00:00 [INF] Feature: HighWeight: 0.06676193 2020-01-08 11:34:03.933 +00:00 [INF] Feature: LowWeight: 0.1926264 2020-01-08 11:34:03.934 +00:00 [INF] Feature: STO_FastStochWeight: 0.19846 2020-01-08 11:34:03.938 +00:00 [INF] Feature: STO_StochKWeight: 0.5019926 2020-01-08 11:34:03.941 +00:00 [INF] Feature: STO_StochDWeight: 0.3781931 2020-01-08 11:34:03.942 +00:00 [INF] Feature: STOWeight: 0 2020-01-08 11:34:03.943 +00:00 [INF] Feature: CCI_TypicalPriceAvgWeight: 0.131141 2020-01-08 11:34:03.944 +00:00 [INF] Feature: CCI_TypicalPriceMADWeight: 0.1299266 2020-01-08 11:34:03.946 +00:00 [INF] Feature: CCIWeight: 1 2020-01-08 11:34:03.947 +00:00 [INF] Feature: RSIDownWeight: 0.4761779 2020-01-08 11:34:03.948 +00:00 [INF] Feature: RSIUpWeight: 0.1249975 2020-01-08 11:34:03.951 +00:00 [INF] Feature: RSIWeight: 0.2877662 2020-01-08 11:34:03.952 +00:00 [INF] Feature: MOMWeight: 0.1822069 2020-01-08 11:34:03.953 +00:00 [INF] Feature: ADX_PositiveDirectionalIndexWeight: 0.2435836 2020-01-08 11:34:03.954 +00:00 [INF] Feature: ADX_NegativeDirectionalIndexWeight: 0.4263106 2020-01-08 11:34:03.955 +00:00 [INF] Feature: ADXWeight: 0.1899773 2020-01-08 11:34:03.956 +00:00 [INF] Feature: CMOWeight: 0.2601428

But for PFI I have the following: 2020-01-08 11:34:09.369 +00:00 [INF] Calculating Binary Classification Feature PFI 2020-01-08 11:34:09.371 +00:00 [INF] Feature PFI for learner:BinaryFastTree 2020-01-08 11:34:09.383 +00:00 [INF] Close| 0.000000 2020-01-08 11:34:09.384 +00:00 [INF] Open| 0.000000 2020-01-08 11:34:09.385 +00:00 [INF] High| 0.000000 2020-01-08 11:34:09.386 +00:00 [INF] Low| 0.000000 2020-01-08 11:34:09.391 +00:00 [INF] STO_FastStoch| 0.000000 2020-01-08 11:34:09.400 +00:00 [INF] STO_StochK| 0.000000 2020-01-08 11:34:09.401 +00:00 [INF] STO_StochD| 0.000000 2020-01-08 11:34:09.402 +00:00 [INF] STO| 0.000000 2020-01-08 11:34:09.404 +00:00 [INF] CCI_TypicalPriceAvg| 0.000000 2020-01-08 11:34:09.406 +00:00 [INF] CCI_TypicalPriceMAD| 0.000113 2020-01-08 11:34:09.408 +00:00 [INF] CCI| 0.000000 2020-01-08 11:34:09.414 +00:00 [INF] RSIDown| 0.000221 2020-01-08 11:34:09.416 +00:00 [INF] RSIUp| 0.000000 2020-01-08 11:34:09.431 +00:00 [INF] RSI| 0.000000 2020-01-08 11:34:09.443 +00:00 [INF] MOM| -0.003003 2020-01-08 11:34:09.457 +00:00 [INF] ADX_PositiveDirectionalIndex| 0.000000 2020-01-08 11:34:09.467 +00:00 [INF] ADX_NegativeDirectionalIndex| 0.000000 2020-01-08 11:34:09.470 +00:00 [INF] ADX| 0.000000 2020-01-08 11:34:09.479 +00:00 [INF] CMO| 0.000000

My question is essentially - what should I read (if anything) into zero values for PFI. The evaluation score too: 020-01-08 11:34:17.135 +00:00 [INF] Score: -4.640871 2020-01-08 11:34:17.138 +00:00 [INF] Probability: 0.1351293

I would appreciate any thoughts that you may have regarding using such info to improve model veracity.

Thank you Fig

antoniovs1029 commented 4 years ago

Can you please share the code you used to print those values to check a couple of things?

lefig commented 4 years ago

Pleasure and thank you for your help!

The logging functions:

private void LogModelWeights(LinearBinaryModelParameters subModel, string name)
        {
            var weights = subModel.Weights.ToList();

            // Log the model parameters.
            Logger.Info(name + $"Parameters");
            Logger.Info("Bias: " + subModel.Bias);
            Logger.Info($"Feature Weights:");

            // 1 Feature Weights
            for (int i = 0; i < features.Length; i++)
            {
                contributions[i].Weight = weights[i];
                contributions[i].Contribution = 0;  // The weight will be assigned by the prediction engine
                                                    // Using CalculateFeatureContribution (bellow)
                Logger.Info(" Feature: " + contributions[i].Name + "Weight: " + contributions[i].Weight);
            }
        }

private void LogPermutationMetics(IDataView transformedData, 
            ImmutableArray<BinaryClassificationMetricsStatistics> permutationMetrics)
        {
            var allFeatureNames = GetColumnNamesUsedForPFI(transformedData);
            var mapFields = new List<string>();
            for (int i = 0; i < allFeatureNames.Count(); i++)
            {
                var slotField = new VBuffer<ReadOnlyMemory<char>>();
                if (transformedData.Schema[allFeatureNames[i]].HasSlotNames())
                {
                    transformedData.Schema[allFeatureNames[i]].GetSlotNames(ref slotField);
                    for (int j = 0; j < slotField.Length; j++)
                    {
                        mapFields.Add(allFeatureNames[i]);
                    }
                }
                else
                {
                    mapFields.Add(allFeatureNames[i]);
                }
            }

            // Now let's look at which features are most important to the model
            // overall. Get the feature indices sorted by their impact on AUC.
            // The importance, or the absolute average decrease in R-squared metric calculated 
            // by PermutationFeatureImportance can then be ordered from most important to least important.
            var sortedIndices = permutationMetrics
                .Select((metrics, index) => new { index, metrics.AreaUnderRocCurve })
                .OrderByDescending(
                feature => Math.Abs(feature.AreaUnderRocCurve.Mean));

            Console.WriteLine($"Feature indices sorted by their impact on AUC:");

            foreach (var feature in sortedIndices)
            {
                Console.WriteLine($"{mapFields[feature.index],-20}|\t{Math.Abs(feature.AreaUnderRocCurve.Mean):F6}");
            }

            Console.WriteLine($"PMI AUC Logged as the following:");
            // Combine metrics with feature names and format for display
            for (int i = 0; i < permutationMetrics.Length; i++)
            {
                Logger.Info($"{importances[i].Name}|\t{permutationMetrics[i].AreaUnderRocCurve.Mean:F6}");
                importances[i].AUC = permutationMetrics[i].AreaUnderRocCurve.Mean;
            }
        }
najeeb-kazmi commented 4 years ago

Hi @lefig - can you share the code that generates the objects passed to these logging functions? LinearBinaryModelParameters subModel IDataView transformedData ImmutableArray<BinaryClassificationMetricsStatistics> permutationMetrics

Please also share code for any data processing and model training.

PFI values for features being 0 mean that permuting the feature values did not change AreaUnderRocCurve much. This is not the same as the weight learned by the model being 0. You can have non-zero weights for a feature that are not statistically significant, and you could end up with a situation where PFI metrics are 0.

Note that PFI value is just one indicator of feature importance, not a conclusive statement of feature importance. That said, so many features having PFI of 0 warrants some further investigation. Here are a few reasons I can think of that can possibly explain this.

lefig commented 4 years ago

Hi @najeeb-kazmi

Thank you for your kind help. The code that generates the metrics is as follows (this is an example of one such learner that requires a calibrator).

private void CalculateGamCalibratedClassificationPermutationFeatureImportance(MLContext mlContext, IDataView transformedData,
                                                        ITransformer trainedModel, string learner)
        {
            // Extract the trainer (last transformer in the model)          
            var singleTrainerModel = (trainedModel as BinaryPredictionTransformer<CalibratedModelParametersBase<GamBinaryModelParameters,
                PlattCalibrator>>);

            //Calculate Feature Permutation
            ImmutableArray<BinaryClassificationMetricsStatistics> permutationMetrics =
                                            mlContext
                                                .BinaryClassification.PermutationFeatureImportance(predictionTransformer: singleTrainerModel,
                                                                                         data: transformedData,
                                                                                         labelColumnName: "Label",
                                                                                         numberOfExamplesToUse: 100, permutationCount: 50);
            Logger.Info("Calculating Binary Classification Feature PFI");
            Logger.Info("Feature PFI for learner:" + learner);
            LogPermutationMetics(transformedData, permutationMetrics);
        }

I tend to think (your point 2) that the model is poor and needs some features removed. Hence I was hoping to have some insight regarding the names of those features so that I can proceed with changing the model.

Best wishes Fig

najeeb-kazmi commented 4 years ago

@lefig

najeeb-kazmi commented 4 years ago

@lefig any update on this and the information I requested? Also, did any of my suggestions help in debugging this?

I'm curious to see why this is happening as it is quite unusual. As I mentioned, it's not clear which model is giving you 0 PFI, Gam or linear. Would be nice to see reproducible example so I can debug this (small snippet of the data and the actual code for training the model and calculating PFI).

lefig commented 4 years ago

Hi @najeeb-kazmi

I really appreciate your time and help with this. Please let me generate some further test data and I will get back to you.

najeeb-kazmi commented 4 years ago

@lefig if this is still an issue, please feel free to reopen.