Closed kransom14 closed 4 years ago
Hi @kransom14 no gbm() will not automatically encode categorical variables for you. It’s easy enough to use the carret::dummyVars() function with gbm though if you need to one hot encode. The cattonum package may also be of interest here.
gbm() does not formally create all the indicator variables. For categorical variables with many levels that ends up being expensive in terms of memory and computation time. It is faster to just keep them stored as a single integer-type variable. Then gbm uses the CART algorithm’s method for selecting how to separate the categorical variable optimally into two groups.
Greg
From: kransom14 notifications@github.com Sent: Monday, July 8, 2019 6:42 PM To: gbm-developers/gbm gbm@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [gbm-developers/gbm] Does gbm() create dummy variables from factors automatically with formula interface? (#44)
I recently discovered that the train() function in caret with automatically create dummy variables for categorical features if the formula interface is used (topepo/caret#1051 https://github.com/topepo/caret/issues/1051 ).
I use train() to tune my models and then run them outside of caret with the gbm package. Does the gbm() function also create dummy variables automatically when the formula interface is used? I assume no because the variables names in the summary are the original variables.
mtcars$cyl <- as.factor(mtcars$cyl) mod <- gbm(mpg ~. , data = mtcars, bag.fraction = 1)
Distribution not specified, assuming gaussian ...
var rel.inf cyl cyl 40.5674320 hp hp 18.0381619 wt wt 15.1975678 disp disp 9.9238322 carb carb 6.0346971 drat drat 5.7064964 am am 1.7297626 vs vs 1.1550008 gear gear 0.8605528 qsec qsec 0.7864963
If dummy variables are not created: how does the function create the design matrix from the formula? Is it done with model.frame as per the gbm() help file in reference to gbm.fit()?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gbm-developers/gbm/issues/44?email_source=notifications&email_token=ACERTQAA4EBQJUZ2SN4YPIDP6O7CBA5CNFSM4H7ABOW2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G562FLA , or mute the thread https://github.com/notifications/unsubscribe-auth/ACERTQFYH22RK3WSO7ZH5WTP6O7CBANCNFSM4H7ABOWQ .
Great, thank you.
My categorical variable has about 100 classes. I prefer to leave it as a factor because the importance becomes diluted if it is one-hot-encoded. During cross validation tuning with so many classes, some classes are not present in the resampled training data but are present in the cross validation testing data. In this case of new levels in the testing data, does gbm() treat the new classes as missing by sending them to the missing node? What if there are no missing cases in the training data for that variable, does it still create a missing node?
Sorry for the extreme delay, gbm should always create a missing node (you can see this for an individual tree using gbm::pretty.gbm.tree()
and new levels should be treated as missing.
I recently discovered that the
train()
function incaret
with automatically create dummy variables for categorical features if the formula interface is used (https://github.com/topepo/caret/issues/1051).I use
train()
to tune my models and then run them outside ofcaret
with thegbm
package. Does thegbm()
function also create dummy variables automatically when the formula interface is used? I assume no because the variable names in the summary are the original variables.If dummy variables are not created: how does the function create the design matrix from the formula? Is it done with
model.frame
as per thegbm()
help file in reference togbm.fit()
?