gbm-developers / gbm3

Gradient boosted models
133 stars 117 forks source link

gbm_roc_area #156

Closed aaronhope1981 closed 5 years ago

aaronhope1981 commented 5 years ago

I'm trying to verify the accuracy of the model using the area under curve to get something akin to an r^2 value with linear models. I have 21 variables feeding in, and a Y/N output. The variables on the way in are both numerical (discrete and continuous) and categorical (some Y/N, some broader).

I consistently get an error not matter how I try to poke it with a stick. The vignette states that if the output is binary, then use the gbm.conc() command. Unfortunately, it is not recognized by R.

There is also no clear way to implement either the gbm_roc_area() or gbm.conc(). The documentation states that it is able to be directly executable, but since R doesn't even recognize gbm.conc(), I haven't been able to try to get that to work at all.

Is this a broken feature? Does anyone have some input on it working? Does anyone have an example for the "obs" "pred" values that go in? Do I load a gbm object as the observed, then use out of bag for the pred? Do I need to change my output to zero and one? Should I simply pick the value off of the gbm.perf()?

bradleyboehmke commented 5 years ago

Hey @aaronhope1981 , could you provide a small reproducible example so we can replicate and pinpoint the problem? Thanks.

bgreenwell commented 5 years ago

@aaronhope1981 It seems you may be mixing up the functions from the two different packages: gbm and gbm3. Which package are you trying to use? gbm.conc() and gbm.perf() are exported functions from gbm while gbm_roc_area() is an exported function from gbm3 (you would instead use gbm.roc.area() with gbm). With you reproducible example, could you also specify which package?

aaronhope1981 commented 5 years ago

Using gbm3 (most current posted here). Trying to use the commands out of the vignette, which include gbm_roc_area (with underscore) and gbm.conc (without underscore, which threw me off).

gbm.conc() Error in gbm.conc() : could not find function "gbm.conc"

That error is understandable if it is not even supposed to be in the package, but the vignette has it spelled out exactly that way. In that light, I don't even have a malfunctioning version of that function as it is unknown to gbm3.

I would love to give a reproducible example, but I'm afraid that I'm an amateur at this, and frankly am not sure what to even feed into the gbm_roc_area. Further, I'm not even 100% sure that this is what I need to be doing given the goal in mind. The instructions state pred and obs, but sadly I'm not even sure which object I'm supposed to feed into the function. There is no current example in the vignette.

The main goal of what I'm trying to do is find a r^2 equivalent for this function to assess how much of the variance in the response variable is explained by the predictors. If I'm trying to use the wrong function, I can accept that. Truthfully both me and my PHD instructor are stumped on this one.

On the note of gbm.perf, they function perfectly in gbm3, and were listed in the vignette. All three of the examples, cv, test, and OOB functioned perfectly, though I'm not sure what information they display. If they are in fact displaying the effective r^2 value, then I would like to apologize for wasting your time by barking up the wrong tree.

bgreenwell commented 5 years ago

Unfortunately, there is no R-squared type metric for binary classification outside of logistic regression and its close relatives (at least not to my knowledge). ROC and AUC do provide a statistic that is measured on a similar scale (i.e., 0--1), but I would not use gbm_roc_area to obtain it (its more of an internal function that serves other purposes), perhaps consider using gbm with the pROC package (overly simple example below):

# Load required packages
library(gbm)

# Load some sample data
data(Sonar, package = "mlbench")
head(Sonar)

Sonar$Class <- ifelse(Sonar$Class == "R", 1, 0)  # recode target as 0/1

# Train a simple GBM using control params
set.seed(101)  # for reproducibility
fit <- gbm(
  formula = Class ~ ., 
  data = Sonar, 
  distribution = "bernoulli",
  n.trees = 1000,
  interaction.depth = 5,
  shrinkage = 0.1,
  cv = 5
)

# Plot Bernoulli deviance (loss function being optimized)
gbm.perf(fit, method = "cv")
best_iter <- gbm.perf(fit, plot.it = FALSE, method = "cv")

# Get predicted probabilities
pred <- predict(fit, n.trees = best_iter, type = "response")

# Plot ROC and compute AUC
roc <- pROC::roc(
  response = as.factor(Sonar$Class),
  predictor = pred
)
plot(roc)
roc
aaronhope1981 commented 5 years ago

boostedmonstermodel01 <- gbm(

  • formula = grad_binary ~
  • creds_ovl_cum + creds_ewu_cum +
  • gpa_ewu_cum + gpa_ewu_lastq + gpa_ewu_trend + gpa_hs_cum +
  • sat_math + sat_lit + first_gen + fin_aid + pell + veteran +
  • dss + stem + athlete + intl + gender + age + race +
  • gpa_xfer_to_ewu + gpa_hs_to_ewu,
  • data = cleantibble,
  • train.fraction = 0.75,
  • mFeatures = 3,
  • cv.folds = 3,
  • verbose = F,
  • distribution = "bernoulli",
  • n.trees = 3000,
  • interaction.depth = 5,
  • shrinkage = 0.01
  • )
    Warning messages: 1: In gbm_call(gbm_object_list$data, gbm_object_list$dist, gbm_object_list$params, : Some terminal node predictions were excessively large for Bernoulli and have been capped. Likely due to a feature that separates the 0/1 outcomes. Consider reducing shrinkage parameter. 2: In gbm_call(gbm_object_list$data, gbm_object_list$dist, gbm_object_list$params, : Some terminal node predictions were excessively large for Bernoulli and have been capped. Likely due to a feature that separates the 0/1 outcomes. Consider reducing shrinkage parameter. summary(boostedmonstermodel01) var rel_inf creds_ovl_cum creds_ovl_cum 32.75012265 creds_ewu_cum creds_ewu_cum 16.48783556 gpa_ewu_lastq gpa_ewu_lastq 12.74883176 gpa_ewu_cum gpa_ewu_cum 11.09057568 gpa_ewu_trend gpa_ewu_trend 7.07947633 stem stem 6.50356311 gpa_hs_to_ewu gpa_hs_to_ewu 4.10769491 gpa_xfer_to_ewu gpa_xfer_to_ewu 2.93380403 gpa_hs_cum gpa_hs_cum 1.84538995 age age 1.73337643 sat_math sat_math 0.75373102 sat_lit sat_lit 0.63427602 race race 0.56133266 first_gen first_gen 0.19988613 pell pell 0.16010367 dss dss 0.10337631 fin_aid fin_aid 0.09183221 veteran veteran 0.07693396 gender gender 0.05785065 intl intl 0.05358034 athlete athlete 0.02642662

Trying to find area under curve from github

best_iter01 <- gbm.perf(boostedmonstermodel01, plot.it = FALSE, method = "cv")

Getting prediction probabilities

pred01 <- predict(boostedmonstermodel01, newdata = cleantibble, n.trees = best_iter01, type = "response")

Plot roc and calculate auc

roc01 <- pROC::roc(

  • response = as.factor(cleantibble$grad_binary),
  • predictor = pred01
  • ) plot(roc01) roc01

Call: roc.default(response = as.factor(cleantibble$grad_binary), predictor = pred01)

Data: pred01 in 15049 controls (as.factor(cleantibble$grad_binary) 0) < 18950 cases (as.factor(cleantibble$grad_binary) 1). Area under the curve: 0.9145

It took a bit of wrangling, but overall it pushed out exactly what I needed. Thank you for helping me prove that this model is a solid fit for the response (graduation rate in this case).

I owe you a beer if you're ever in Spokane.

bgreenwell commented 5 years ago

Glad it worked...cheers! 🍻