dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.35k stars 8.74k forks source link

[Objective] Roadmap of Objective Functions #749

Closed tqchen closed 6 years ago

tqchen commented 8 years ago

This is an issue created for centralizing all discussions and proposals of new objective functions.

The new refactor comes with a https://github.com/dmlc/xgboost/tree/master/plugin system which hopefully makes adding new objectives easier.

bzEq commented 8 years ago

what does objective function mean(the link above is not valid any more)? like delegate in C#?

shafiab commented 7 years ago

@tqchen: @suntzu86 and I are working on survival models at Yelp. We've been using R's gbm package, but we wanted to transition into xgboost due to success with it in other models/projects.

To that end, we wanted to either add or have someone add the Cox Proportional Hazards regression objective function. R gbm uses the Breslow Approximation for breaking ties, so we thought that would be a good place to start: https://en.wikipedia.org/wiki/Proportional_hazards_model#Tied_times Efron's seems like the primary alternative (more accurate, more complicated).

Does this sound interesting to you? Would you suggest that we use the plugin system linked above or create a new objective function type directly in the xgboost source code? (Like alongside the handlers for linear, logrec, etc.)

walterreade commented 7 years ago

@shafiab I would be very interested to see survival models incorporated into xgboost. Have you see this? Perhaps it is helpful.

https://scikit-survival.readthedocs.io/en/latest/index.html

suntzu86 commented 7 years ago

@walterreade we took a look at that, but we want a py2 compatible version. We're also using breslow's approximation and added a hessian computation.

@tqchen: @shafiab and I worked on a plugin to add the coxph (with breslow's approximation) objective function to xgboost for a hackathon at work. Our preliminary branch is here: https://github.com/shafiab/xgboost/tree/coxph_breslow_objective_fcn_plugin

The bulk of the code is here: https://github.com/shafiab/xgboost/tree/coxph_breslow_objective_fcn_plugin/plugin/cox_ph

Cox requires extra data to compute its objective. We hacked this into xgboost for the time being by adding a field to MetaInfo: https://github.com/shafiab/xgboost/blob/coxph_breslow_objective_fcn_plugin/include/xgboost/data.h#L41 This has been discussed previously (https://github.com/dmlc/xgboost/issues/513) but the proposed solution there (storing the extra data as a string attribute) is a bit painful. Plus it's very hard to get CV to work with that solution b/c we don't have access to train/test indices in the folds. So our quick & dirty solution was to add data storage.

We imagine you will want a more general solution than having a "censor" field that is only used by 1 objective function. Ideas we've thought of include 1) allowing "labels" to be a matrix or 2) storing train/test fold indices so that users can split arbitrary data themselves. Open to suggestions of course.

Anyway, our branch is still in a hacky state (e.g., extra printfs, whitespace, etc). So we'll do some clean-up and improvements before making a pull request. But we wanted to give you guys a heads up and start the discussion.

khotilov commented 7 years ago

@suntzu86 @shafiab You can use the label to pack both, time and event status. It would be twice longer than data, but it should not be a problem to unpack within the objective and metrics. I thought about adding it for quite awhile, but never felt inspired enough, since based on my previous experience with boosted Cox proportional hazard in gbm it wasn't really offering any significant predictive or operational advantages (at least for my problems). There are just too many hard to please assumptions with plain CoxPH.

suntzu86 commented 7 years ago

@khotilov we considered that too. CV stopped us though. In particular, that requires reworking how "Slice" works: https://github.com/shafiab/xgboost/blob/e9107e6e6a6aa5eb1c3507f1972b32248c0ad20f/src/c_api/c_api.cc#L377 b/c you cannot just push_back anymore. (You could if you interleaved the data, but that probably breaks other assumptions.) Not the end of the world but it seemed like an easy source of confusion.

We thought "stacking" label vectors like that would be pretty confusing for users (e.g., what if you stack in transposed/interleaved order? aka C vs F ordering in numpy). Esp in C/C++, I could see it getting confusing figuring out what data is where.

We've had good experiences w/coxph for our applications. We've also found xgboost to be more flexible/convenient than R's gbm and the prediction quality is better too. We're hoping to put together a few comparisons btwn cox+Rgbm and cox+xgboost :)

bnuzyc91 commented 6 years ago

@tqchen @suntzu86 I am currently reviewing your cox_ph objective function. The grad part is the first order derivative, and the hess part is the second order derivative, But the hess part seems to be wrong. Could you help to clarify that?

For Cox Proportional Hazard, the deviance and gradient function can been seen in http://www.saedsayad.com/docs/gbm2.pdf But for the hess part, based on my calculation, it should be Hess=\sum_j \delta_j I(t_i>=t_j)exp(f(x_i)) /denominator - \sum_j \delta_j { I(t_i>=t_j)exp(f(x_i)) }^2 / denominator^2 Could you help to verify that?

shafiab commented 6 years ago

@bnuzyc91 you are right about the bug in hessian. I just pushed an update https://github.com/shafiab/xgboost/blob/coxph_breslow_objective_fcn_plugin/plugin/cox_ph/coxph_obj.cc

shafiab commented 6 years ago

@bnuzyc91 Regarding your question related to censor part, @suntzu86 pointed it out earlier - the way we were attempting to add the censor variable might not work as is out of the box, unless we modify in other parts of the repo. I believe a good alternative would be to use label to pass both time_to_convert and censoring information together (e.g. a +ve value can indicate an event that's not being censored, whereas -ve value can indicate a censored event with the value being the time_to_convert in both cases). In such case, we will need to add additional parsing logic in the objective function to convert the label to censor and time_to_convert variable.

bnuzyc91 commented 6 years ago

@shafiab Using label to pass both time_to_convert and censoring information together is a good idea! I noticed that in the python-package/xgboost you have modified the core.py and training.py so that censor variable can be accepted from the python wrapper. Did you do the same thing for R-package? it seems wrapper codes in the R side has not be touched.

Chandanpanda commented 6 years ago

Hi, Where can we get the xgboost r package with the survival:cox objective function implemented. is it in the cran version?

ZankoNT commented 6 years ago

Hi, just pulled xgboost for R from the drat repo but I can't find the option to use survival:cox as an objective. Is it available for R? Or is it just python only at the moment? Thanks!