dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

how we show bins associated with numeric feature so end user can easily identify? #4912

Closed nighotatul closed 4 years ago

nighotatul commented 4 years ago

@yaeldekel,@eerhardt,@najeeb-kazmi,@justinormont,@CESARDELATORRE

thanks for giving example for each feature weights.

with the references below link with subject lines:- how we show permutation slot associated with feature so end user easily identify? https://github.com/dotnet/machinelearning/issues/4739

Picture1

suppose if this is yearatcomapny feature is having numeric data how we can get bins with model weights if this data is continuous then how we can get scatter points?

2) how we can get score,probability,confusion matrix of PFI also?

najeeb-kazmi commented 4 years ago

@nighotatul

  1. Getting feature importance for bins of YearsAtCompany:

This depends on whether you encoded this feature as a categorical feature when training, for example if you applied a custom mapping transform to generate a new column with three categorical values LessThanOne, OneToFive, FiveToTen, etc. If you then use a categorical transform to featurize this column, then PFI should give you distinct feature importance for each category.

This, however, is likely to give a poorer model than one where you use YearsAtCompany as a numerical variable, as you lose information by binning the feature. I would not recommend this from a model quality perspective.

  1. Score, Probability, and Confusion Matrix:

Score and probability are specific to each row of the data, not model wide metrics. As such, they are not relevant to PFI, which is model wide feature importance. Contribution of each feature to the predicted score of each row of data can be calculated with the MLContext.Transforms.CalculateFeatureContributions API (example).

As far as confusion matrix is concerned, it is not relevant to PFI either since since PFI measures importance by how much a metric changes when a feature is randomly permuted. Confusion matrix is not a metric so there is no change to calculate by permuting features.

In summary, score, probability, and confusion matrix are conceptually unrelated to PFI.

nighotatul commented 4 years ago

Please consider this model concept for question raise prospective.but we have application where user (Data Scientist) select the columns by their own. they put the columns in label area and feature area.so when we calculate PFI and showing the graph we show the model statistics and PFI graph. as per selecting string type feature on pfi we show the distribution graph suppose say for example user select "marital status" then we show unique value of marital status with their respective weights. but if user select say "Years at company" on pfi then it should display distribution of that feature if this Numeric feature is categorical then it should display bins with their respective weights and if it continuous then it should display scatter plot. so if there any example or guidance which is very helpful to us to achieve the desire result. So user can better understand impact analysis.

najeeb-kazmi commented 4 years ago

What you are asking for is not related to PFI at all. You can't get PFI for individual values or bins of values of a numeric feature - it is just not possible. PFI works by permuting all the values of a feature and calculating how a metric changes because that feature was permuted. If you want to treat different bins of YearsAtCompany, you will have to encode it as a categorical feature yourself, then you will get PFI for individual bins, just like you do with MaritalStatus.

As far as a scatterplot is concerned, it is just not possible with PFI, as you don't get PFI for individual points, but for the entire feature itself. What you are looking for here is CalculateFeatureContributions. This will give you relative importance of YearsAtCompany for each individual row of data, which you can use to visualize as you want.

This is not something ML.NET will support.