CCS-Lab / easyml

A toolkit for easily building and evaluating machine learning models.
https://ccs-lab.github.io/easyml
Other
40 stars 16 forks source link

Question about handling categorical variables #105

Closed irisshen926 closed 5 years ago

irisshen926 commented 5 years ago

My colleague and I have been using Easyml package to analysis out data. We were wondering how the algorithm handles categorical variable (for example, race). We have coded Race to have 6 different values. However, they are not on a continuous scale. We looked into the source code to try to figure out how the algorithm handles categorical variables. For glmnet, by default the preprocess method is preprocess_scale(), which only scale the numerical variable and leave the categorical variables unchanged.

set_preprocess <- function(preprocess = NULL, algorithm) { if (is.null(preprocess)) { if (algorithm == "glmnet") { preprocess <- preprocess_scale } else if (algorithm == "random_forest") { preprocess <- preprocess_identity } else if (algorithm == "support_vector_machine") { preprocess <- preprocess_scale } }

preprocess }

In Preprocess.R if (is.null(mask)) {

No categorical variables

  X_output <- data.frame(scale(X))
  output <- list(X = X_output)
} else {
  # Categorical variables
  X_categorical <- X[, mask, drop = FALSE]
  X_numerical <- X[, !mask, drop = FALSE]
  X_standardized <- data.frame(scale(X_numerical))
  X_output <- cbind(X_categorical, X_standardized)
  output <- list(X = X_output)
}

We end up re-coding Race into a binary variable (white and non-white) since we only had a small amount of subjects who are in non-white categories. However, we just wanted to ask to see how should we handle categorical variables such as Race in the future.

Thank you so much for your help!

youngahn commented 5 years ago

Sorry for a delayed reply. Hope you are well, Iris!

In easyml, you can specify variables as (binary) categorical variables using the categorical_variables input argument. Check out this link for an example: https://ccs-lab.github.io/easyml/articles/titanic.html.

You can also possibly code Race into multiple binary variables (e.g., Asian = 1 or 0, White = 1 or 0, African_American = 1 or 0) and enter them as categorical variables in easyml.

Best, Young