Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.1k stars 2.52k forks source link

raw feature importance values drop to zero unexpectedly #809

Closed BillmanH closed 4 years ago

BillmanH commented 4 years ago

using: azureml-sdk[automl,explain,notebooks]==1.0.85

Repro steps: When running a regression model explanation, raw_explanation.get_feature_importance_dict() low values drop off.

{'a': 0.1231724761533697,
 'b': 0.11367146207316817,
 'c': 0.11131743155810468,
...
 'd': 0.004097432295042588,
 'e': 0.0031938698542980223,
 'f': 0.0019239375660602262,
 'g': 0.0,
 'h': 0.0,
 'i': 0.0}

I plotted out the shap values here to show that they drop of at a very low point. image

The values are higher than zero because ... I know this. However it seems to take everything below a certain point and drop it to 0.0 even though the next highest values is 0.002 per above. I can't find any documentation about the truncation.

This also makes it look like there is a sudden drop between the explainability of one feature and the next (albeit minor).


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

dataders commented 4 years ago

@wamartin-aml @imatiach-msft

imatiach-msft commented 4 years ago

@BillmanH sorry about the trouble you are having. To clarify, are you using the mimic explainer or the linear explainer? If using a regression model and using the linear explainer, I wouldn't expect to see this. However, the mimic explainer (the default one used by AutoML) trains a global surrogate model for the output of the original or teacher model, so this could be an artifact of the regularization parameters in the surrogate LightGBM model.

imatiach-msft commented 4 years ago

@BillmanH maybe I am misunderstanding the issue though, I'm not sure why you are expecting the feature importance values to always be non-zero. Could you provide a sample notebook or dataset with this issue? If not, maybe a bit more context on why the feature importance values should be non-zero would help clarify the issue more.

BillmanH commented 4 years ago

@imatiach-msft , you may have discovered it. I'll look into this and update if this turns out to be the issue.

BillmanH commented 4 years ago

Thanks @imatiach-msft , I guess I assumed that the mimic explainer was just a naming convention and not a type of explainer. Honestly I think I assumed that because the type of model was declared in autoML that it would not be necessary to declare it again in the explanation process.

Looking at: https://docs.microsoft.com/en-us/python/api/azureml-explain-model/azureml.explain.model?view=azure-ml-py
I can see that there are several types of explainers. We went through several and found that the AzureML explainers return both expect a sepcific type of model. The Model object retrieved from AutoML appears to be the wrong model type.

Running:

explainer = LinearExplainer(fitted_model, X )

Returns Exception: An unknown model type was passed:

I noticed that the mimic explainer has a specific mimicwrapper that is used. Is there a similar linearwrapper that is required as well?

Bottom line, I want the explanation for a regression model. I feel like the older approach was way easier than this. Is there a document that will walk you through the differences between setting up the linear model and the mimic (default) process?

imatiach-msft commented 4 years ago

@BillmanH azureml-explain-model is the old deprecated package, it was split into interpret-community which is open-sourced (https://github.com/interpretml/interpret-community) and azureml-interpret. Interpret-community is an extension to interpret (https://github.com/interpretml/interpret) which was written by MSR and mainly focuses on EBM, their glassbox model.

If the model object from automl contains pre-processing steps then it can't be used with SHAP's linear explainer, since it's a greybox explainer, or model-specific explainer (mimic explainer is a blackbox explainer, similar to SHAP's KernelExplainer) - I'm guessing that is the issue you are running into. If you can share the code I can take a look at it, or maybe we can discuss via a teams meeting. I would like to understand what fitted_model is, specifically if it's a pipeline and what it contains.

Could you explain what you mean by the older approach? Do you mean the TabularExplainer which is a composition of multiple shap-based methods and finds the best explainer for the given model?

BillmanH commented 4 years ago

Thanks @imatiach-msft , this is clearly what the issue is. I think you've shown us that we need to go through a lot more documentation to get to what we want but you've put us on the right track. I'm sure it will be worth the extra effort.