Hi. I'm sharing a little function that approximately preserves class balance within each training and validation set, even when ids are non-unique, as long as the minority class is rare (adapted from CVFolds). -- Susan stratifyCVFoldsById <- function (V, Y, id = NULL) {

1. distribute the ids that have Y = 1 in any of the rows equally among all the folds,

# 2. separately, distribute the ids that have Y = 0 for all rows equally among the folds
if (is.null(id)) id <- 1:length(Y)
case_status_by_id <- by(Y, id, sum)  # this gives n.unique results, sorted by id #
case_ids <- names(case_status_by_id)[ case_status_by_id > 0]
noncase_ids <- names(case_status_by_id)[ case_status_by_id == 0]
if (V > min(length(case_ids), length(noncase_ids))) {
    stop("number of observations in minority class is less than the number of folds")
    }
valSet.case_ids <- split(sample(case_ids), rep(1:V, length = length(case_ids)))
valSet.noncase_ids <- split(sample(noncase_ids), rep(1:V, length = length(noncase_ids)))
validRows <- vector("list", length = V)
    names(validRows) <- paste(seq(V))
for (v in seq(V)){
    validRows[[v]] <- which(as.character(id) %in% c(valSet.case_ids[[v]],  valSet.noncase_ids[[v]]))
}                 
   return(validRows)

}

ecpolley / SuperLearner

stratifyCV by iid #139

1. distribute the ids that have Y = 1 in any of the rows equally among all the folds,