Compatibility of `generate_cv_index` outputs with `rsample` functions

sigmafelix commented 9 months ago

As we want to adopt tidymodels interface for base learners, spatiotemporal cross-validation indices generated from generate_cv_index need to be useable in rsample functions. Working example is as below, which will be available in my working branch soon:

#' Generate manual rset object from spatiotemporal cross-validation indices
#' @param cvindex integer. Output of [`generate_cv_index`].
#' @param data data.frame from [stdt][`convert_stobj_to_stdt`]. Should be the
#' same object as what was used for `covars` argument of [`generate_cv_index`]
#' @param cv_mode character(1). Spatiotemporal cross-validation indexing method.
#' See `cv_mode` description in [`generate_cv_index`].
#' @returns rset object of `rsample` package. A tibble with a list column of
#' training-test data.frames and a column of labels.
#' @author Insang Song
#' @importFrom rsample make_splits
#' @importFrom rsample manual_rset
#' @export
convert_cv_index_rset <- function(cvindex, data, cv_mode) {
  maxcvi <- max(cvindex)
  len_cvi <- seq_len(maxcvi)
  list_cvi <- split(len_cvi, len_cvi)
  list_cvi_rows <-
    lapply(
      list_cvi,
      function(x) {
        list(analysis = which(cvindex != x),
             assessment = which(cvindex == x))
      }
    )
  list_split_dfs <-
    lapply(
      list_cvi_rows,
      function(x) {
        rsample::make_splits(x = x, data = data)
      }
    )
  modename <- sprintf("cvfold_%s_%03d", cv_mode, len_cvi)
  rset_stcv <- rsample::manual_rset(list_split_dfs, modename)
  return(rset_stcv)

}

kyle-messier commented 9 months ago

@sigmafelix We don't have to adopt all of the tidymodels, but can where it makes sense. So this function helps rsample play well with the various S-T cross-validation methods?

sigmafelix commented 9 months ago

@Spatiotemporal-Exposures-and-Toxicology Yes, cross-validation in tidymodels operates on rsample function outputs, and we get cross-validation fold indices as an integer vector from generate_cv_index. The function above utilizes the integer vector and the original data.frame that has the number of rows the same as the integer vector to make a rset class object.

# provided that dfcovarst is a data.frame with PM2.5 and covariates along with required fields of lon, lat, and time:
dfcovarstdt <- convert_stobj_to_stdt(dfcovarst)
dfcovarstdt$stdt$time <- as.Date(dfcovarstdt$stdt$time)
dfcovars_lblto <-
  generate_cv_index(dfcovarstdt, "lblto", blocks = c(10, 10), t_fold = 60L)
dfcovarstdt_cv <-
  convert_cv_index_rset(dfcovars_lblto, dfcovarstdt$stdt, "lblto")

## tidymodel specification
xgb_mod <-
  parsnip::boost_tree(learn_rate = tune::tune()) |>
  set_engine("xgboost", eval_metric = list("rmse", "mae")) |>
  set_mode("regression")

pm25mod <- workflow() |>
  add_model(xgb_mod) |>
  add_formula(pm2.5 ~ .) |>
  tune::tune_bayes(resamples = dfcovarstdt_cv, iter = 50) |>
  fit_resamples(dfcovarstdt_cv, yardstick::metric_set(rmse, mae))

sigmafelix commented 8 months ago

[ ] Add validation set generation in generate_cv_index()

kyle-messier commented 7 months ago

@sigmafelix @eva0marques @mitchellmanware @dzilber @dawranadeep

I think there is a lot of value in using the R tidymodels for all of our base and meta learners. I think we should enforce all of the models be based here so that we can keep things relatively simple. Unfortunately, @dzilber and @dawranadeep that means we probably can't have a GP base learner since I do not see that as an option.

Also, @sigmafelix @mitchellmanware - I suggest we keep things simple with the neural network and utilize the brulee package, if even it means only implementing a feed-forward network.

With a tidymodel approach, I think we can implement these base learners with similar inputs and relatively simply:

Next, the stacks package is a meta-learner based on penalized regression that integrates into the tidymodels.

@michael-conway If you have the bandwidth, then it would be great to get your input on using the pins and vetiver packages for creating official versions and deploying our models to the NIEHS Posit Connect. This would make a the versioning in beethoven seamless if we can rely on well developed and document Posit packages.

NIEHS / beethoven

Compatibility of `generate_cv_index` outputs with `rsample` functions #277