carpentries-incubator / r-ml-tabular-data

A Data-Carpentry-style lesson on some ML techniques in R
https://carpentries-incubator.github.io/r-ml-tabular-data/
Other
3 stars 0 forks source link

_episodes_rmd/06-Exploration.Rmd: Major code edit #29

Closed gmcdonald-sfg closed 2 years ago

gmcdonald-sfg commented 2 years ago

In the “Repeat Cross Validation in a Loop” section, I would suggest using lapply or purrr::map instead of for loops. Many in the R community these days are moving away from for loops to list-based iterative functions like lapply or purrr::map. Here’s a good source for reasons why: https://www.earthdatascience.org/courses/earth-analytics/automate-science-workflows/use-apply-functions-for-efficient-code-r/ . I think this is particularly important for cross-validation in ML because lapply (and purrr::map) allow for parallel processing, while for loops do not. This is exactly what the furrr package is great for. CV is a good example of an “embarrassingly parallel” problem that can/should be done in parallel when possible. You could simply rewrite your code as follows:

xgb.cv_wrapper <- function(params){

  rwCV <- xgb.cv(params = params,
                 data = dtrain, 
                 nrounds = 500, 
                 nfold = 10,
                 early_stopping_rounds = 10,
                 verbose = FALSE)
  bestResults <- rwCV$evaluation_log[rwCV$best_iteration]
}

set.seed(708)
# lapply option
bestResults <- lapply(paramList,
                      FUN = xgb.cv_wrapper) %>%
  bind_rows()

set.seed(708)
# purrr::map alternative
bestResults <- map_dfr(paramList,
                      .f = xgb.cv_wrapper)

Or, if you want progress bars:

install.packages("progress")
library(progress)

xgb.cv_wrapper <- function(params){
  rwCV <- xgb.cv(params = params,
                 data = dtrain, 
                 nrounds = 500, 
                 nfold = 10,
                 early_stopping_rounds = 10,
                 verbose = FALSE)
  pb$tick()
  bestResults <- rwCV$evaluation_log[rwCV$best_iteration]
}

set.seed(708)
# lapply option
pb <- progress_bar$new(total = length(paramList))
bestResults <- lapply(paramList,
                      FUN = xgb.cv_wrapper) %>%
  bind_rows()

set.seed(708)
# purrr::map alternative
pb <- progress_bar$new(total = length(paramList))
bestResults <- map_dfr(paramList,
                       .f = xgb.cv_wrapper)
djhunter commented 2 years ago

No, I think that adds too much complexity to the lesson. The loop just runs over a few values, so I don't think parallelization is worth it. The xgb.cv function (on my machine) takes advantage of multiple cores already, so it appears that the part that benefits most from parallelization is already parallelized.