dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.18k stars 8.71k forks source link

Misguided Implementation of Saabas' method in `RegTree::CalculateContributions(Approx)` #8251

Open nalzok opened 2 years ago

nalzok commented 2 years ago

Despite citing the original blog post Interpreting random forests, the current implementation of Saabas' method in RegTree::CalculateContributionsApprox seems misguided.

https://github.com/dmlc/xgboost/blob/7d43e74e71c1a0f6526fda78d46395c883d14cb0/src/tree/tree_model.cc#L1214-L1218

In particular, the mean_values parameter is filled by the function FillNodeMeanValues, which as its name suggests, simply calculates the average response within a particular node.

https://github.com/dmlc/xgboost/blob/7d43e74e71c1a0f6526fda78d46395c883d14cb0/src/predictor/cpu_predictor.cc#L260-L275

However, I argue that we should remove the parameter mean_values and use (*this)[nid].LeafValue() instead. This is because the idea of Saabas' method is to see how the prediction changes as we go deeper into the tree.

To quote the blog post by Saabas,

What's novel here is that you can see the breakdown of the prediction, written down in terms of value changes along the prediction path, together with feature names that “caused” every value change due to being in the guard (the numbers are approximate due to rounding).

For random forests, the prediction/weight coincides with the average value within that node, i.e. $-\frac{G_j}{H_j}$, which is what the current implementation calculates. However, for XGBoost with regularization, the value should be $-\frac{\operatorname{ThresholdL1}(G_j-\alpha)}{H_j+\lambda}$, where $\alpha$ and $\lambda$ are L1 and L2 regularization parameters respectively.

https://github.com/dmlc/xgboost/blob/7d43e74e71c1a0f6526fda78d46395c883d14cb0/src/tree/param.h#L246-L258

Additionally, we are using Saabas' method as an approximation for (Tree)SHAP in this context. The SHAP value for a feature is the average change in model output by conditioning on that feature when introducing features one at a time over all feature orderings, and Saabas' approximates it by only considering a particular ordering of the features, i.e. the one specified by the tree. This would not be the case if we calculate the average response instead of the actual model prediction.

To sum up, I believe the RegTree::CalculateContributionsApprox method should use the weight/prediction value in each node, instead of mean_values. I'm more than happy to open a PR to address that. Please let me know what you think.

trivialfis commented 2 years ago

@RAMitchell

nalzok commented 2 years ago

I would like to point out that RegTree::CalculateContributions has a similar problem

https://github.com/dmlc/xgboost/blob/7d43e74e71c1a0f6526fda78d46395c883d14cb0/src/tree/tree_model.cc#L1417-L1421

Here node_value is not the expected value of the tree's predictions. Consider a decision stump with a root and two leaves. Denote the set of samples in the left leaf with $L$, and those in the right leaf with $R$. Also assume we are dealing with MSE loss with L2 regularization parameter $\lambda > 0$, and define $N = |L| + |R|$ to be the total number of samples in the tree. The expected value of the tree's predictions would be

$$ \begin{equation} \begin{aligned} \operatorname*{E}(f(X)) =& \frac{1}{N} \sum_{i=1}^N f(Xi) \ =& \frac{|L|}{N} \frac{\sum{X \in L}f(X)}{|L|+\lambda} + \frac{|R|}{N} \frac{\sum{X \in L}f(X)}{|R|+\lambda} \ <& \frac{|L|}{N+\lambda} \frac{\sum{X \in L}f(X)}{|L|} + \frac{|R|}{N+\lambda} \frac{\sum{X \in L}f(X)}{|R|} \ =& \frac{\sum{X \in L \cup R}f(X)}{N+\lambda} \ <&\frac{\sum_{X \in L \cup R}f(X)}{N} \end{aligned} \end{equation} $$

Note that the penultimate line is the prediction at the root node, and the last line is the average response within the root. Clearly the former is a better approximation of $\operatorname*{E}(f(X))$ (it's still not exact, but hopefully close enough for shallow trees). My proposal is to use the following

// approximate the expected value of the tree's predictions
if (condition == 0) {
  bst_float node_value = (*this)[0].LeafValue();
  out_contribs[feat.Size()] += node_value;
}

With the change, we no longer need the mean_values and can thus skip the FillNodeMeanValues function. This would enable us to calculate the feature contribution in a shorter time!

However, note that the SHAP values should add up to the model's prediction for the particular input feat, i.e. Local accuracy. If we change the "bias" while holding other feature contributions constant, would this property still hold? In fact, I wonder how it could hold in the first place since we got the expected value of the tree's predictions wrong...

RAMitchell commented 2 years ago

Will try to go through this in detail when I get time.

RAMitchell commented 2 years ago

Focusing on the RegTree::CalculateContributions part for now.

Why does the lambda regularisation term appear in this formula? It's not involved in prediction at all.

When you write $\frac{|L|}{N} \frac{\sum{X \in L}X}{|L|+\lambda}$ do you mean $\frac{|L|}{N} \frac{\sum{X \in L}f(X)}{|L|+\lambda}$? $X_i$ is a vector.

nalzok commented 2 years ago

@RAMitchell

Why does the lambda regularisation term appear in this formula? It's not involved in prediction at all.

According to this section, the weight/leaf value/prediction for the $j$-th node/leaf is $w_j^* = \frac{G_j}{H_j+\lambda}$, so lambda is definitely involved in prediction. Intuitively this makes sense because L2 regularization shrinks the prediction towards zero.

And yes, you are right. I have fixed the typo now. Thanks for pointing that out!

RAMitchell commented 2 years ago

When you make the prediction $f(X_i)$ the lambda term is already baked into this return value. You don't have to include it.

nalzok commented 2 years ago

Sorry, there is some bad notation. In the case of MSE loss, we have $f(X_i) = \frac{G_K}{HK+\lambda} = \frac{\sum{X_j \in K}Y_j}{|K|+\lambda}$, where $K$ is the leaf node within which $X_i$ falls. What I am trying to say is that we need to do Saabas/TreeSHAP with the prediction value, i.e. $f(Xi)$, not average response, i.e. $\frac{\sum{X_j \in K}Y_j}{|K|}$. You are right that $\lambda$ should be baked in, but this is not the case when we calculate mean_values.

$$ \begin{equation} \begin{aligned} \operatorname*{E}(f(X)) =& \frac{1}{N} \sum_{i=1}^N f(Xi) \ =& \frac{|L|}{N} \frac{\sum{X_j \in L}Yj}{|L|+\lambda} + \frac{|R|}{N} \frac{\sum{X_j \in R}Yj}{|R|+\lambda} \ <& \frac{|L|}{N+\lambda} \frac{\sum{X_j \in L}Yi}{|L|} + \frac{|R|}{N+\lambda} \frac{\sum{X_j \in R}Yj}{|R|} \ =& \frac{\sum{X_j \in L \cup R}Yj}{N+\lambda} \ <&\frac{\sum{X_j \in L \cup R}Y_j}{N} \end{aligned} \end{equation} $$

One issue I encountered when implementing the change is that I realized you cannot call LeafValue on inner nodes, as they only store split conditions. Shall we make it a struct instead of union, and always store the prediction weight of each node as if it is a leaf? That won't take extra space since we are re-using some of the padding bytes.

https://github.com/dmlc/xgboost/blob/8d247f0d64bee392857b330b20bfe6086798eb7b/include/xgboost/tree_model.h#L266-L273

By the way, we are writing a paper on a related topic and it will cover this issue. I will post it here when it's published.

nalzok commented 2 years ago

Oh, I can see your point now. It turns out that mean_values[n] calculates the expected prediction for the subtree with node n as its root, instead of the average response within n as I initially thought. I'll take the RegTree::CalculateContributions part back. In that case, we indeed need mean_values[0], but that's the only value we ever look at: we don't need to access mean_values[n] for any node n other than the root.

However, what I said about RegTree::CalculateContributionsApprox still stands. The Sabbas method is about prediction values along the path, and it does not care about conditional expectation.

nalzok commented 2 years ago

I have opened a PR to show what I mean. It's really just a minor change so please take a look.

I have also looked into the original PR which implemented the feature contribution. There was some disagreement about how mean_values should be calculated. Someone suggested that we should calculate the prediction value as if the node is a leaf, i.e. base_weights, while a developer argued that we should calculate averages of leaf values instead. We ended up implementing the developer's idea, but I'm bringing base_weights up because it does has some merits.

RAMitchell commented 2 years ago

I think I see what you mean. I agree that base_weights should be used to exactly implement Saabas' method according to the blog post. If our approximate contributions method exists as an approximation to our Shapley values implementation it makes more sense to me the way it is currently written.

As a side note, it just occurred to me that we could strongly improve the accuracy of the approximation (if indeed we are trying to approximate Shapley values as implemented by Scott Lundberg) by doing a forward and backwards pass through the tree. If the forward pass through the tree is a single sample of a feature ordering, its reverse ordering corresponds to an antithetic sample with variance reduction properties: https://arxiv.org/abs/2104.12199

nalzok commented 2 years ago

OK, what about I adding a new option saabas_contribs to calculate the feature contributions using Saabas' method? We are not breaking anything that way. Basically, there are four possible combinations with pred_contribs=True:

  1. Predict(..., saabas_contribs=False, approx_contribs=False) calculates SHAP with TreeSHAP;
  2. Predict(..., saabas_contribs=False, approx_contribs=True) approximates SHAP by using averages of leaf values;
  3. Predict(..., saabas_contribs=True, approx_contribs=False) implements Saabas' method exactly.
  4. Predict(..., saabas_contribs=True, approx_contribs=True) also implements Saabas' method exactly.

One issue is that cases 3 and 4 give the same result, but I think that's fine because Saabas' method is computationally cheap, and currently we don't warn for invalid combinations like pred_contribs=False, approx_contribs=True anyway. Let me know if you want me to raise an error or warning to make sure users know what they are doing. The important thing here is to clarify that approx_contribs=True does not execute Saabas's method despite the resemblance, and to provide an option to calculate Saabas contributions instead of an approximated SHAP.

I'm not really familiar with the XGBoost codebase so I would appreciate a high-level overview of what I am supposed to do. Can I just add a line here saying kSaabasContribution = 4 and increase the value of subsequent members by 1? Should we also add kSaabasInteraction?

https://github.com/dmlc/xgboost/blob/8d247f0d64bee392857b330b20bfe6086798eb7b/include/xgboost/learner.h#L33-L41

Regarding your sidenote about antithetic sampling, it sounds cool and I definitely want to work on that, but I will focus on getting Saabas' method right first. Do you want to open a new issue for that?

nalzok commented 2 years ago

Another thing I don't quite understand is this remark

https://github.com/dmlc/xgboost/blob/8d247f0d64bee392857b330b20bfe6086798eb7b/python-package/xgboost/core.py#L2007-L2010

Why is setting approx_contribs=True not recommended?