alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
777 stars 86 forks source link

Prediction explanations should aggregate contributions across all levels of categorical features #1347

Closed freddyaboulton closed 3 years ago

freddyaboulton commented 4 years ago

If a user calls any of the explain predictions functions on a pipeline that has a OHE, the resulting table will have a row for each level of a categorical variable. We should aggregate the contributions across all levels of a categorical feature into a single row in the table so that the output is more intuitive.

There are two ways to do this:

  1. Store a mapping from the encoded feature names to the original feature name. Then, I guess we would sum the contributions from all of the encoded features and call that the original features' contribution.
  2. Pass the entire pipeline into the shap algorithm instead of just the pipeline estimator like we do for permutation_importance/partial dependence. Unfortunately, the shap library checks against a hardcoded list of estimators so I don't think it's viable.

I think we're left with 1 - looks like the library author says summing is reasonable for encoded columns so it's not too bad of a plan.

freddyaboulton commented 3 years ago

@chukarsten I think this is doable now that #1416 got done

freddyaboulton commented 3 years ago

@chukarsten @dsherry This issue as written only mentions the OHE. I think the plan we have in place for that is pretty clear and reasonable.

But we can also apply the same plan for other components, e.g. the TextFeaturizer and DateTimeFeaturizer. Is aggregating the shap values for the features created by those components in scope for this issue?

I think aggregating the shap values for features created by the TextFeaturizer makes sense because the features are not easily interpretable by humans and the names given to these features are a bit confusing.

I'm on the fence about the DateTimeFeaturizer . I think it's pretty clear what the features created by the DateTimeFeaturizer are, e.g. date_day_of_week, date_month, and users may want to know how the day-of-week impacts a prediction vs the month. So I'm in favor of not aggregating those features but curious what you guys think.

chukarsten commented 3 years ago

@freddyaboulton I think your instincts serve you well and these additional components should be done. I would certainly err on the side of keeping the PRs smaller and limiting the scope of this one to OHE makes sense and filing discrete issues for TF and DTF make sense. Just to keep reviews manageable and to better propagate lessons learned from completion of OHE PR to the others.

freddyaboulton commented 3 years ago

Discussed the requirements with @rpeck @dsherry @chukarsten @angela97lin and @bchen1116 :

We agreed the best long term solution is to return both the aggregated and not-aggregated shap values for the features that support aggregation.

To close this issue, only the "dict" output format needs to contain the non-aggregated shap values.

I'll try to display the aggregated and non-aggregated values for the text and dataframe output formats but if it turns out to be hard I'll file a follow-up issue!

rpeck commented 3 years ago

Yay on including both the aggregated and non-aggregated SHAP values!

One note, @freddyaboulton and @dsherry:

SHAP values have the nice property that if the things we're interpreting are independent in the statistical sense, the Shapley values can be added together. This is great for things like OHE, where the features are disjoint.

However, for things like DateTime we can have different subgroups of ways of feature engineering the raw column. Each of these might or might result in a group of mutually exclusive engineered columns, but the groups could overlap. In this case, I'm 97% sure we can't just add everything together. The original source feature would be over-counted. I have to sleep on this a couple of nights to really understand how Shapley values handle conditional probabilities like this...

Think of DateTime:

However, if we add all of these, the original feature will look 3x as important, or something like that.

Just like interpretability techniques like Partial Dependence plots, the independence assumption means that Shapley values assume that you can vary one feature while holding the others constant. That obviously isn't the case for these overlapping-subgroup cases.

I think we should hold off on aggregating DateTime and text feature-engineered columns until we understand this better. We meaning I intend to give this a lot of subconscious background cycles.

rpeck commented 3 years ago

A sidebar, @freddyaboulton: this means that for things like DateTime I think we will probably want to support having more than one way of splitting a given feature, each of which will sum to the same aggregated Shapley value for the original feature.

freddyaboulton commented 3 years ago

@rpeck thanks for the helpful insight! I get what you mean by overlapping subgroups and I agree it's best that we hold off on aggregating them.

rpeck commented 3 years ago

From short Slack discussion:

The main points here are:

  1. We shouldn't try to do this right now, but
  2. It would be great if the API output format will naturally extend to this case. That shouldn't require any additional implementation work.

Examples of what I mean:

"the_date": {
    "is holiday" : {
    true: 0.15,
    false: 0.25
    },
    "day of week" : {
    "Sunday": 0.00,
    "Monday": 0.08,
    "Tuesday": 0.08,
    "Wednesday": 0.08,
    "Thursday": 0.08,
    "Friday": 0.08,
    "Saturday": 0.00
    }
}
"sex": {
    "one-hot encoded": {
    "male": 0.03,
    "female": 0.01,
    "decline_to_state": 0.02
    }
}
dsherry commented 3 years ago

Discussed with @freddyaboulton @chukarsten

Problems Can't access feature value for OHE-aggregated features in a general-purpose way.

Options

  1. Walk through pipeline, and for any features derived from an OHE step, aggregate those features
    • Challenge: current imp @freddyaboulton has been working on can't get the feature value from intermediate step in the pipeline. We don't have a clear way to do that today.
    • One idea: re-evaluate pipeline up to / through each OHE step. Or to put it another way, implement a "cache everything" mode for pipeline eval, turn that on for pred expls, and access those values.
  2. Always show both a high-level rollup (to orig set of features), and lowest-level breakdown (to all generated features)
  3. Punt on this support.

@freddyaboulton likes option 2. But we should discuss further before deciding.

freddyaboulton commented 3 years ago

We agreed on option 2 while we work towards option 1 in the long term!

Here are some comments from the shap package author stating that it's reasonable to sum values across features:

https://github.com/slundberg/shap/issues/282 https://github.com/slundberg/shap/issues/465 https://github.com/slundberg/shap/issues/933

rpeck commented 3 years ago

Shapley values by definition are summable IF the individual values are for disjoint features.

rpeck commented 3 years ago

Since they're linear, you could presumably do this for overlapping groups:

  1. calculate the Shapley value for the intersection of the groups
  2. add the Shapley value for the groups, and subtract the intersection

venndiagram #yaylinearity

rpeck commented 3 years ago

These ones just say that you can sum the components of disjoint feature groups. That of course covers OHE, but it doesn't cover the question of overlapping features coming from DateTime expansion, for example:

slundberg/shap#282 slundberg/shap#465

This one has a comment from the author that I link to below that talks about interactions that seems to support what I was saying about overlapping groups:

slundberg/shap#933

https://github.com/slundberg/shap/issues/933#issuecomment-564269636

He says that the effect should be quite low for interactions. However, the effect of the "intersection" term for interactions should usually be much smaller than the features themselves. This isn't true for things like day of week being grouped with month...

Still thinking.

rpeck commented 3 years ago

Ok. So. :-)

I thought and slept on this over the weekend, and I talked over my thoughts with Dan Putlier this morning to see what he thought. The summary of that conversation is this: his intuitions about summing Shapley values for engineered features matches mine. [breathes sigh of relief]

tl;dr: We can sum all of the "leaf" generated features up to a valid Shapley value for the source feature. The behavior of overlapping subgroups of generated features is trickier and is something to attack "later".

Details:

  1. For cases like OHE, we all agree there's no real issue: the generated features cover the space and are non-overlapping, so they can be summed up to a valid Shapley value for the source feature.
  2. For cases like DateTime, where we have subgroups like DayOfWeek and Month that overlap, there is an issue in the subgroups. Like the OHE case, we can sum the Shapley values all of the generated columns (think of these as the leaf nodes of the decomposition) up to the source column.
  3. If the subgroups have the same property as OHE, namely that each subgroup (e.g. DayOfWeek) covers the space and each member (e.g., isMonday, isTuesday, etc) is independent, we can sum those members up to a subgroup total. In other words, the Shapley value for DayOfWeek should be valid.
  4. However, while the attribution to the subgroups is unbiased it could have a lot of variance. Just like the case of collinear features of any other nature, we need to be careful and say that the user should be cautious to put a lot of faith in how much explanatory power is attributable to each overlapping subgroup. In other words, if we have DayOfWeek and Month, we can't necessarily trust their relative values.
  5. Dan said the only way to really understand the variance (to put bounds on the Shapley values for overlapping subgroups) is by using Monte Carlo techniques. I suggested that looking at the Shapley values from different cross-val models would be useful for this, and he agreed. Maybe also looking at the variance of the explanations for different trees is another way. This is a research topic. ;-)
  6. One of the issues here is that TreeExplainer uses conditional rather than marginal distributions for the features, just because of the nature of how it works...
  7. My head hurts.