dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.17k stars 8.71k forks source link

[Feature request] Expected output over feature subsets #6107

Open arsarabi opened 4 years ago

arsarabi commented 4 years ago

This is a feature request for computing the expected output of the model when only a subset of features are present.

We have a use case where we would like to observe the expected output of the model when multiple features are missing at evaluation time. In other word, we would like to compute the expected output over a subset of all features. In a decision tree, this can be done by finding all leaves that correspond to present (non-missing) features, and computing the weighted average of outputs at said leaves. This is similar to the TreeSHAP algorithm, but instead of tracking all possible subsets, we are only interested in particular subsets (e.g., through a binary mask of [nsamples, nfeatures] provided to the predict function). The slower TreeSHAP algorithm includes psuedocode for computing this expectation (see Algorithm 1 in the paper).

I was wondering if this can be implemented in XGBoost? I think this would be a great feature. Note that this is different than how XGBoost handles missing features during training by choosing the best path for missing values. Instead, this enables handling missing features at evaluation time.

trivialfis commented 4 years ago

I think it's possible. But XGBoost also has its own way of handling missing value. During training it can learn a default direction for missing values.

arsarabi commented 4 years ago

I understand, but I think the two are for different scenarios. The way XGBoost currently handles missing values can help when features are missing both during training and testing. This would allow handling cases where features are only missing at evaluation time. It could also be a fast proxy to evaluate the performance of a subset of all features (e.g., for feature selection) instead of training a separate model for each combination of features.