dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.3k stars 8.73k forks source link

SHAP values (discuss) #3659

Closed hlbkin closed 6 years ago

hlbkin commented 6 years ago

Hi Team!

I am just starting with SHAP package and have read both papers (for tree and general framework).

What I am interested in is predicting own impact on y with influence on one of the varibales x_i. The end goal is to use it further in optimal control problem where y=f(X) is just a part of the system where we can control and influence some of x_i's

Is shap values good in measuring such impact and using further in Optimal Control problems? Bascially looking for some non-linear analog of Betas in linear regression. Im asking this because as far as I understood the paper, SHAP measures impact of x_i in case x_i is missing versus it is present in dataset, not really x_i is 0 versus x_i!=0 (or some particular value)

Note that Im NOT interested in absolute importance of variables or feature selection. Control variable x_i might have very low importance compared to other varibales, but be №1 priority in predicting dy/dx_i.

If not, what are good ways to do it? I saw partial dependence plots in sklearn might do somewhat useful for non-linear models.

Thanks in advance

hcho3 commented 6 years ago

I just found this interesting paper: https://arxiv.org/pdf/1705.10883.pdf. The problem of maximizing the tree output is expressed as a mixed integer programming and solved approximately.

tqchen commented 6 years ago

This is a great discussion that I think could be a great topic at https://discuss.xgboost.ai/

hlbkin commented 6 years ago

@hcho3 thanks a lot, I'll try to read it today and try to find other papers. @tqchen do I need to create separate thread there or you can move this one as a moderator?

I've also started same discussion in original SHAP repository with better explanation of our problem. https://github.com/slundberg/shap/issues/249 Probably someone would be interested to follow it. Also I think this is not Xgboost only problem, but more general question about non-linear models, the main point would be to find a mathematically valid way to compute online impact of a certain variables that we can controll (for example, SHAP values in Java/C++)