dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.14k stars 8.71k forks source link

New Feature - Adjusted Probability when using scale_pos_weight #863

Closed DeluxeAnalyst closed 8 years ago

DeluxeAnalyst commented 8 years ago

In the documentation, it says that we can use scale_pos_weight to help with unbalanced data, but it shouldnt be used when you care about predicting the right probability.

There is an equation that can be used to turn the predicted probability using scale_pos_weight into the right probability. This would allow for us to use scale_pos_weight, and get the right probability.

Here is the calculation: (app = adjusted posterior probability)

app0 <- (1-prediction) * (A/E) / (C/F) app1 <- (prediction) * (B/E) / (D/F)

adjusted.prediciton <- app1 / (app1 + app0)

Legend: prediction = predicted probability for each record A = Original # of Non-Rare Records B = Original # of Rare Records C = New # of Non-Rare Records D = New # of Rare Records E = Original Total # of Records F = New Total # of Records

Would you be able to implement this into the xgboost package? This would be very helpful and i wouldnt need to run this calculation myself on the back end.

tqchen commented 8 years ago

You are talking about re-calibration. This can be supported as a post processing step that is rather independent of xgboost package. On the other hand, if you really want to handle imblance data and still get the probability see https://xgboost.readthedocs.org/en/latest/param_tuning.html#handle-imbalanced-dataset

DmitryUlyanov commented 7 years ago

Not related to xgboost, but just for people found this page: the formulas above are incorrect. Should be

app0 <- (1-prediction) (C/F) / (A/E) app1 <- (prediction) (D/F) / (B/E)

Deom23 commented 7 years ago

Dmitry, you are actually incorrect. You should run through an example and you will see the problem, the original formula is correct. If I used A = 10000; B = 100; C = 100; D = 100; E = 10100; F = 200, and given these 3 predictions: .37, .89, and .46, after running these predictions through the equations I get .0058, .0748 and .0084 respectively, in relation to an original response rate of .0099, these numbers look perfect.

If I use your "corrected" formula, it gives .983, .998, and .988, which are obviously not correct.

DmitryUlyanov commented 7 years ago

@Deom23 Your intuition seems to be wrong. In your case old prior p(y=1) = 0.01, and new prior q(y=1) = 0.5. How do you think the predictions should change if you know, that you are more likely to get an object of class 1 now? The predicted probabilities should increase, right? In the limit, when q(y=1) = 1, our prediction should be "definitely class 1" p(y|x) = 1 for any x. Looks intuitive to me, while you say the probabilities should decrease 0.37->0.0058.

And of course these formulas are not developed by intuition, and we only check if the result is intuitive. To get sense how it is derived see e.g. p.2.3 of "Classifier Adaptation at Prediction Time", Royer and Lampert. Compare the formulas in the paper with original ones and mine.

Deom23 commented 7 years ago

Dmitry, I think we are talking about different things. Lets say I am building a logistic probability model, trying to identify buyers from non-buyers. If my universe has 100 buyers and 10,000 nonbuyers. The predictions coming out of this model will average out to .0099, because this is the original response rate of the universe.

In this rare-case scenario, lets say I decide to build a second model, using the same 100 buyers, but only using 100 non-buyers this time, instead of 10,000. The predictions coming out of this model will average out to .5, because this is the new response rate of this smaller universe.

The equations above take the predictions from Model 2, and put them onto the same scale as Model 1. So yes, I fully expect for a prediction of .37 coming out of the second model to be changed to .0058, because this is on the same scale as the original .0099 response rate.

Your equations do not accomplish this, they dont take predictions made from over-sampling and "undo" the over-sampling to put them back on the original scale of the universe.