ModelOriented / forester

Trees are all you need
https://modeloriented.github.io/forester/
GNU General Public License v3.0
112 stars 15 forks source link

Data loss prevention in make_catboost.R after BayesianOptimization #37

Closed lhthien09 closed 2 years ago

lhthien09 commented 2 years ago

@Szmajasz hi Szymon, from line 66 to 126 in function make_catboost.R ` # Creating validation set in ratio 4:1 splited_data <- split_data(data, target, type) data <- splited_data[[1]] data_val <- splited_data[[2]]

# Creating pool objects for catboost 
categorical <- which(sapply(data, is.factor))
cat_data <- catboost::catboost.load_pool(data[, -which(names(data) == target), drop = FALSE],
                                         data[, target], cat_features = categorical)

cat_data_val <- catboost::catboost.load_pool(data_val[, -which(names(data_val) == target), drop = FALSE],
                                             data_val[, target], cat_features = categorical)

### Preparing tuning function 
catboost_tune_fun <- function(iterations, depth, learning_rate, random_strength, bagging_temperature, border_count, l2_leaf_reg){
  # Model for evaluating hyperparameters
  catboost_tune <- catboost::catboost.train(cat_data,
                                            params = list(verbose = 0,
                                                          iterations = iterations,
                                                          depth = depth,
                                                          learning_rate = learning_rate,
                                                          random_strength = random_strength,
                                                          bagging_temperature = bagging_temperature,
                                                          border_count = border_count,
                                                          l2_leaf_reg = l2_leaf_reg))

  # Evaluating model
  predicted <- catboost::catboost.predict(catboost_tune, cat_data_val)
  if (type == "classification"){
    predicted <- ifelse(predicted >= 0.5, 1, 0)
  }
  score <- desc * calculate_metric(tune_metric, predicted, data_val[[target]])

  list(Score = score, Pred = predicted)
}

### Tuning process
message("--- Starting tuning process")
tuned_catboost <- rBayesianOptimization::BayesianOptimization(catboost_tune_fun,
                                       bounds = list(iterations = c(10L, 1000L),
                                                     depth = c(1L, 8L),
                                                     learning_rate = c(0.01, 1.0),
                                                     random_strength = c(1e-9, 10),
                                                     bagging_temperature = c(0.0, 1.0),
                                                     border_count = c(1L, 255L),
                                                     l2_leaf_reg = c(2L, 30L)),
                                       init_grid_dt = NULL,
                                       init_points = 10,
                                       n_iter = tune_iter,
                                       acq = "ucb",
                                       kappa = 2.576,
                                       eps = 0.0,
                                       verbose = TRUE)

# Best hyperparameters
catboost_params <- append(tuned_catboost$Best_Par, list(verbose = 0))

# Creating final model 
cat_model <- catboost::catboost.train(cat_data, params = catboost_params)

}` We used Bayesian Optimization to find most optimal tuple of hyperparameters. But I found that, we split the data into data and data_val, after finding optimal HP, we should train the model again on the original data_train to prevent data loss. I just think of combining those two structures: cat_data and cat_data_val from your code, but I don't know specifically whether it would be fine. It's much better if we can combine those two cat_data and cat_data_val instead of creating new variable.

lhthien09 commented 2 years ago

I meant in the final line # Creating final model we should use the whole original data_train for the sake of data loss prevention.

kozaka93 commented 2 years ago

Thank you. We have made major changes to the forester package. The previous version of the package is available on the old branch. It will not be supported, we encourage you to use the new one.