Does gbm() create dummy variables from factors automatically with formula interface?

kransom14 commented 5 years ago

I recently discovered that the train() function in caret with automatically create dummy variables for categorical features if the formula interface is used (https://github.com/topepo/caret/issues/1051).

I use train() to tune my models and then run them outside of caret with the gbm package. Does the gbm() function also create dummy variables automatically when the formula interface is used? I assume no because the variable names in the summary are the original variables.

mtcars$cyl <- as.factor(mtcars$cyl)
mod <- gbm(mpg ~. , data = mtcars, bag.fraction = 1)

Distribution not specified, assuming gaussian ...

var    rel.inf
cyl   cyl 40.5674320
hp     hp 18.0381619
wt     wt 15.1975678
disp disp  9.9238322
carb carb  6.0346971
drat drat  5.7064964
am     am  1.7297626
vs     vs  1.1550008
gear gear  0.8605528
qsec qsec  0.7864963

If dummy variables are not created: how does the function create the design matrix from the formula? Is it done with model.frame as per the gbm() help file in reference to gbm.fit()?

bgreenwell commented 5 years ago

Hi @kransom14 no gbm() will not automatically encode categorical variables for you. It’s easy enough to use the carret::dummyVars() function with gbm though if you need to one hot encode. The cattonum package may also be of interest here.

gregridgeway commented 5 years ago

gbm() does not formally create all the indicator variables. For categorical variables with many levels that ends up being expensive in terms of memory and computation time. It is faster to just keep them stored as a single integer-type variable. Then gbm uses the CART algorithm’s method for selecting how to separate the categorical variable optimally into two groups.

Greg

From: kransom14 notifications@github.com Sent: Monday, July 8, 2019 6:42 PM To: gbm-developers/gbm gbm@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [gbm-developers/gbm] Does gbm() create dummy variables from factors automatically with formula interface? (#44)

I recently discovered that the train() function in caret with automatically create dummy variables for categorical features if the formula interface is used (topepo/caret#1051 https://github.com/topepo/caret/issues/1051 ).

I use train() to tune my models and then run them outside of caret with the gbm package. Does the gbm() function also create dummy variables automatically when the formula interface is used? I assume no because the variables names in the summary are the original variables.

mtcars$cyl <- as.factor(mtcars$cyl) mod <- gbm(mpg ~. , data = mtcars, bag.fraction = 1)

Distribution not specified, assuming gaussian ...

var rel.inf cyl cyl 40.5674320 hp hp 18.0381619 wt wt 15.1975678 disp disp 9.9238322 carb carb 6.0346971 drat drat 5.7064964 am am 1.7297626 vs vs 1.1550008 gear gear 0.8605528 qsec qsec 0.7864963

If dummy variables are not created: how does the function create the design matrix from the formula? Is it done with model.frame as per the gbm() help file in reference to gbm.fit()?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gbm-developers/gbm/issues/44?email_source=notifications&email_token=ACERTQAA4EBQJUZ2SN4YPIDP6O7CBA5CNFSM4H7ABOW2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G562FLA , or mute the thread https://github.com/notifications/unsubscribe-auth/ACERTQFYH22RK3WSO7ZH5WTP6O7CBANCNFSM4H7ABOWQ .

kransom14 commented 5 years ago

Great, thank you.

My categorical variable has about 100 classes. I prefer to leave it as a factor because the importance becomes diluted if it is one-hot-encoded. During cross validation tuning with so many classes, some classes are not present in the resampled training data but are present in the cross validation testing data. In this case of new levels in the testing data, does gbm() treat the new classes as missing by sending them to the missing node? What if there are no missing cases in the training data for that variable, does it still create a missing node?

bgreenwell commented 4 years ago

Sorry for the extreme delay, gbm should always create a missing node (you can see this for an individual tree using gbm::pretty.gbm.tree() and new levels should be treated as missing.

gbm-developers / gbm

Does gbm() create dummy variables from factors automatically with formula interface? #44