Closed DeluxeAnalyst closed 8 years ago
You are talking about re-calibration. This can be supported as a post processing step that is rather independent of xgboost package. On the other hand, if you really want to handle imblance data and still get the probability see https://xgboost.readthedocs.org/en/latest/param_tuning.html#handle-imbalanced-dataset
Not related to xgboost, but just for people found this page: the formulas above are incorrect. Should be
app0 <- (1-prediction) (C/F) / (A/E) app1 <- (prediction) (D/F) / (B/E)
Dmitry, you are actually incorrect. You should run through an example and you will see the problem, the original formula is correct. If I used A = 10000; B = 100; C = 100; D = 100; E = 10100; F = 200, and given these 3 predictions: .37, .89, and .46, after running these predictions through the equations I get .0058, .0748 and .0084 respectively, in relation to an original response rate of .0099, these numbers look perfect.
If I use your "corrected" formula, it gives .983, .998, and .988, which are obviously not correct.
@Deom23 Your intuition seems to be wrong. In your case old prior p(y=1) = 0.01, and new prior q(y=1) = 0.5. How do you think the predictions should change if you know, that you are more likely to get an object of class 1 now? The predicted probabilities should increase, right? In the limit, when q(y=1) = 1, our prediction should be "definitely class 1" p(y|x) = 1 for any x. Looks intuitive to me, while you say the probabilities should decrease 0.37->0.0058.
And of course these formulas are not developed by intuition, and we only check if the result is intuitive. To get sense how it is derived see e.g. p.2.3 of "Classifier Adaptation at Prediction Time", Royer and Lampert. Compare the formulas in the paper with original ones and mine.
Dmitry, I think we are talking about different things. Lets say I am building a logistic probability model, trying to identify buyers from non-buyers. If my universe has 100 buyers and 10,000 nonbuyers. The predictions coming out of this model will average out to .0099, because this is the original response rate of the universe.
In this rare-case scenario, lets say I decide to build a second model, using the same 100 buyers, but only using 100 non-buyers this time, instead of 10,000. The predictions coming out of this model will average out to .5, because this is the new response rate of this smaller universe.
The equations above take the predictions from Model 2, and put them onto the same scale as Model 1. So yes, I fully expect for a prediction of .37 coming out of the second model to be changed to .0058, because this is on the same scale as the original .0099 response rate.
Your equations do not accomplish this, they dont take predictions made from over-sampling and "undo" the over-sampling to put them back on the original scale of the universe.
In the documentation, it says that we can use scale_pos_weight to help with unbalanced data, but it shouldnt be used when you care about predicting the right probability.
There is an equation that can be used to turn the predicted probability using scale_pos_weight into the right probability. This would allow for us to use scale_pos_weight, and get the right probability.
Here is the calculation: (app = adjusted posterior probability)
app0 <- (1-prediction) * (A/E) / (C/F) app1 <- (prediction) * (B/E) / (D/F)
adjusted.prediciton <- app1 / (app1 + app0)
Legend: prediction = predicted probability for each record A = Original # of Non-Rare Records B = Original # of Rare Records C = New # of Non-Rare Records D = New # of Rare Records E = Original Total # of Records F = New Total # of Records
Would you be able to implement this into the xgboost package? This would be very helpful and i wouldnt need to run this calculation myself on the back end.