Feature Request: Ordinal Classification [wish list]

dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

https://xgboost.readthedocs.io/en/stable/

Apache License 2.0

25.84k stars 8.69k forks source link

Feature Request: Ordinal Classification [wish list] #5243

Open hrbzkm98 opened 4 years ago

hrbzkm98 commented 4 years ago

Similar to #695 (closed). In the original discussion, participants did not seem to understand the difference between "learning-to-rank" and "ordinal classification". "learning-to-rank" produce relative ordering between items, and the output predictions are not class labels, while "ordinal classification" does produce class labels. Here is a blog post to introduce the reader to loss functions for ordinal classification.

Would certainly wish to have developers collaborate on this project of mine to integrate XGBoost with Ordinal Classification. However, the tricky part is the additional parameter "theta" (vector of thresholds) has to be optimized, before one can calculate the optimal leaf weight w. Currently XGBoost does not have support for custom loss function with additional custom parameters.

If you are curious, here is the brief introduction to ordinal classification introduction and its loss function that I wrote.

hcho3 commented 4 years ago

@hrbzkm98 How does your approach compare to Cumulative Logit Model, which is commonly used in statistics for ordinal classification?

hrbzkm98 commented 4 years ago

Probabilistic models for discrete ordinal response have also been studied in the statistics literature [McCullagh, 1980; Fu and Simpson, 2002]. However, the models suggested are much more complex, and even just evaluating the likelihood of a predictor is not straight-forward. From Page 2 Rennie, Jason D. M. and Srebro, Nathan . Loss functions for preference levels: Regression with discrete ordered labels.

Note that Cumulative Logit is the probabilistic approach, proposed by McCullagh in 1980. My approach is easier to scale to large datasets, which is very much the purpose of XGBoost.

hcho3 commented 4 years ago

Interesting. It would be certainly nice to have ordinal classification in XGBoost. We can imagine a situation where the label represents a Likert scale.

At this moment, my hands are currently full right now (preparing 1.0 release). I will come back to this later.

hrbzkm98 commented 4 years ago

Thank you so much for your attention. I plan to write my thesis on this; hopefully can implement my own minimal working version by May.

hcho3 commented 4 years ago

Aside: there is an on-going work to implement a censored regression (survival analysis) in XGBoost: #4763. As part of that, I am adding an extra field to the data matrix, to express a ranged label (lower bound, upper bound).

hrbzkm98 commented 4 years ago

@hcho3 Are you still interested in mentoring for Google Summer of Code? If yes I would be pretty interested in working on this under your guidance.

hcho3 commented 4 years ago

@hrbzkm98 The XGBoost project doesn't have its own GSoC organization. Last summer, I was helping out with RStat, whose administrator reached out to me first.

baozzhao commented 2 years ago

I am also pretty interested in this and would like to help. Has there been any progress on this front? I see that @trivialfis self-assigned this issue.

Two questions:

(1) with the changes to xgboost in the past year and a half, wouldn't it be possible to simply pass Cumulative-logit loss as a custom loss? (2) critically, could monotonic constrains be enabled? Ordinal classification with monotonic constrains would make for a XGBoost natural solution to a very hard problem in Random Forests.

mthorrell commented 1 month ago

@baozzhao , if you're still interested, there's a project I'm working on that combines pytorch and xgboost (here). This makes ordinal regression fairly straightforward to fit using xgboost as you can define the loss with pytorch and then autodiff magic feeds the gradients and hessians to xgboost. I opened an issue over there specifically for ordinal regression and it has some code that could be useful to you: https://github.com/mthorrell/gboost_module/issues/8