Open nalzok opened 2 years ago
@RAMitchell
I would like to point out that RegTree::CalculateContributions
has a similar problem
Here node_value
is not the expected value of the tree's predictions. Consider a decision stump with a root and two leaves. Denote the set of samples in the left leaf with $L$, and those in the right leaf with $R$. Also assume we are dealing with MSE loss with L2 regularization parameter $\lambda > 0$, and define $N = |L| + |R|$ to be the total number of samples in the tree. The expected value of the tree's predictions would be
$$ \begin{equation} \begin{aligned} \operatorname*{E}(f(X)) =& \frac{1}{N} \sum_{i=1}^N f(Xi) \ =& \frac{|L|}{N} \frac{\sum{X \in L}f(X)}{|L|+\lambda} + \frac{|R|}{N} \frac{\sum{X \in L}f(X)}{|R|+\lambda} \ <& \frac{|L|}{N+\lambda} \frac{\sum{X \in L}f(X)}{|L|} + \frac{|R|}{N+\lambda} \frac{\sum{X \in L}f(X)}{|R|} \ =& \frac{\sum{X \in L \cup R}f(X)}{N+\lambda} \ <&\frac{\sum_{X \in L \cup R}f(X)}{N} \end{aligned} \end{equation} $$
Note that the penultimate line is the prediction at the root node, and the last line is the average response within the root. Clearly the former is a better approximation of $\operatorname*{E}(f(X))$ (it's still not exact, but hopefully close enough for shallow trees). My proposal is to use the following
// approximate the expected value of the tree's predictions
if (condition == 0) {
bst_float node_value = (*this)[0].LeafValue();
out_contribs[feat.Size()] += node_value;
}
With the change, we no longer need the mean_values
and can thus skip the FillNodeMeanValues
function. This would enable us to calculate the feature contribution in a shorter time!
However, note that the SHAP values should add up to the model's prediction for the particular input feat
, i.e. Local accuracy. If we change the "bias" while holding other feature contributions constant, would this property still hold? In fact, I wonder how it could hold in the first place since we got the expected value of the tree's predictions wrong...
Will try to go through this in detail when I get time.
Focusing on the RegTree::CalculateContributions
part for now.
Why does the lambda regularisation term appear in this formula? It's not involved in prediction at all.
When you write $\frac{|L|}{N} \frac{\sum{X \in L}X}{|L|+\lambda}$ do you mean $\frac{|L|}{N} \frac{\sum{X \in L}f(X)}{|L|+\lambda}$? $X_i$ is a vector.
@RAMitchell
Why does the lambda regularisation term appear in this formula? It's not involved in prediction at all.
According to this section, the weight/leaf value/prediction for the $j$-th node/leaf is $w_j^* = \frac{G_j}{H_j+\lambda}$, so lambda is definitely involved in prediction. Intuitively this makes sense because L2 regularization shrinks the prediction towards zero.
And yes, you are right. I have fixed the typo now. Thanks for pointing that out!
When you make the prediction $f(X_i)$ the lambda term is already baked into this return value. You don't have to include it.
Sorry, there is some bad notation. In the case of MSE loss, we have $f(X_i) = \frac{G_K}{HK+\lambda} = \frac{\sum{X_j \in K}Y_j}{|K|+\lambda}$, where $K$ is the leaf node within which $X_i$ falls. What I am trying to say is that we need to do Saabas/TreeSHAP with the prediction value, i.e. $f(Xi)$, not average response, i.e. $\frac{\sum{X_j \in K}Y_j}{|K|}$. You are right that $\lambda$ should be baked in, but this is not the case when we calculate mean_values
.
$$ \begin{equation} \begin{aligned} \operatorname*{E}(f(X)) =& \frac{1}{N} \sum_{i=1}^N f(Xi) \ =& \frac{|L|}{N} \frac{\sum{X_j \in L}Yj}{|L|+\lambda} + \frac{|R|}{N} \frac{\sum{X_j \in R}Yj}{|R|+\lambda} \ <& \frac{|L|}{N+\lambda} \frac{\sum{X_j \in L}Yi}{|L|} + \frac{|R|}{N+\lambda} \frac{\sum{X_j \in R}Yj}{|R|} \ =& \frac{\sum{X_j \in L \cup R}Yj}{N+\lambda} \ <&\frac{\sum{X_j \in L \cup R}Y_j}{N} \end{aligned} \end{equation} $$
One issue I encountered when implementing the change is that I realized you cannot call LeafValue
on inner nodes, as they only store split conditions. Shall we make it a struct
instead of union
, and always store the prediction weight of each node as if it is a leaf? That won't take extra space since we are re-using some of the padding bytes.
By the way, we are writing a paper on a related topic and it will cover this issue. I will post it here when it's published.
Oh, I can see your point now. It turns out that mean_values[n]
calculates the expected prediction for the subtree with node n
as its root, instead of the average response within n
as I initially thought. I'll take the RegTree::CalculateContributions
part back. In that case, we indeed need mean_values[0]
, but that's the only value we ever look at: we don't need to access mean_values[n]
for any node n
other than the root.
However, what I said about RegTree::CalculateContributionsApprox
still stands. The Sabbas method is about prediction values along the path, and it does not care about conditional expectation.
I have opened a PR to show what I mean. It's really just a minor change so please take a look.
I have also looked into the original PR which implemented the feature contribution. There was some disagreement about how mean_values
should be calculated. Someone suggested that we should calculate the prediction value as if the node is a leaf, i.e. base_weights
, while a developer argued that we should calculate averages of leaf values instead. We ended up implementing the developer's idea, but I'm bringing base_weights
up because it does has some merits.
I think I see what you mean. I agree that base_weights
should be used to exactly implement Saabas' method according to the blog post. If our approximate contributions method exists as an approximation to our Shapley values implementation it makes more sense to me the way it is currently written.
As a side note, it just occurred to me that we could strongly improve the accuracy of the approximation (if indeed we are trying to approximate Shapley values as implemented by Scott Lundberg) by doing a forward and backwards pass through the tree. If the forward pass through the tree is a single sample of a feature ordering, its reverse ordering corresponds to an antithetic sample with variance reduction properties: https://arxiv.org/abs/2104.12199
OK, what about I adding a new option saabas_contribs
to calculate the feature contributions using Saabas' method? We are not breaking anything that way. Basically, there are four possible combinations with pred_contribs=True
:
Predict(..., saabas_contribs=False, approx_contribs=False)
calculates SHAP with TreeSHAP;Predict(..., saabas_contribs=False, approx_contribs=True)
approximates SHAP by using averages of leaf values;Predict(..., saabas_contribs=True, approx_contribs=False)
implements Saabas' method exactly.Predict(..., saabas_contribs=True, approx_contribs=True)
also implements Saabas' method exactly.One issue is that cases 3 and 4 give the same result, but I think that's fine because Saabas' method is computationally cheap, and currently we don't warn for invalid combinations like pred_contribs=False, approx_contribs=True
anyway. Let me know if you want me to raise an error or warning to make sure users know what they are doing. The important thing here is to clarify that approx_contribs=True
does not execute Saabas's method despite the resemblance, and to provide an option to calculate Saabas contributions instead of an approximated SHAP.
I'm not really familiar with the XGBoost codebase so I would appreciate a high-level overview of what I am supposed to do. Can I just add a line here saying kSaabasContribution = 4
and increase the value of subsequent members by 1? Should we also add kSaabasInteraction
?
Regarding your sidenote about antithetic sampling, it sounds cool and I definitely want to work on that, but I will focus on getting Saabas' method right first. Do you want to open a new issue for that?
Another thing I don't quite understand is this remark
Why is setting approx_contribs=True
not recommended?
Despite citing the original blog post Interpreting random forests, the current implementation of Saabas' method in
RegTree::CalculateContributionsApprox
seems misguided.https://github.com/dmlc/xgboost/blob/7d43e74e71c1a0f6526fda78d46395c883d14cb0/src/tree/tree_model.cc#L1214-L1218
In particular, the
mean_values
parameter is filled by the functionFillNodeMeanValues
, which as its name suggests, simply calculates the average response within a particular node.https://github.com/dmlc/xgboost/blob/7d43e74e71c1a0f6526fda78d46395c883d14cb0/src/predictor/cpu_predictor.cc#L260-L275
However, I argue that we should remove the parameter
mean_values
and use(*this)[nid].LeafValue()
instead. This is because the idea of Saabas' method is to see how the prediction changes as we go deeper into the tree.To quote the blog post by Saabas,
For random forests, the prediction/weight coincides with the average value within that node, i.e. $-\frac{G_j}{H_j}$, which is what the current implementation calculates. However, for XGBoost with regularization, the value should be $-\frac{\operatorname{ThresholdL1}(G_j-\alpha)}{H_j+\lambda}$, where $\alpha$ and $\lambda$ are L1 and L2 regularization parameters respectively.
https://github.com/dmlc/xgboost/blob/7d43e74e71c1a0f6526fda78d46395c883d14cb0/src/tree/param.h#L246-L258
Additionally, we are using Saabas' method as an approximation for (Tree)SHAP in this context. The SHAP value for a feature is the average change in model output by conditioning on that feature when introducing features one at a time over all feature orderings, and Saabas' approximates it by only considering a particular ordering of the features, i.e. the one specified by the tree. This would not be the case if we calculate the average response instead of the actual model prediction.
To sum up, I believe the
RegTree::CalculateContributionsApprox
method should use the weight/prediction value in each node, instead ofmean_values
. I'm more than happy to open a PR to address that. Please let me know what you think.