Is bootstrap / subsampling split stored anywhere?

cjvanlissa commented 2 years ago

I see the bootstrapped datasets, but I do not see which rows of the original data these are. It is desirable to have this information, for example to obtain oob predictions of observed scores per case across all trees.

brandmaier commented 2 years ago

Indeed, this is an important piece of information and this is why it is preserved in SEM forest results. SEM forests store the permuted datasets in the field forest.data. This is a list with an entry for each forest. Each entry again is a named list with entries bootstrap.data for the resampled data (historically, these were bootstraps) and oob.data for the out-of-bag samples. In these data frames, the row names store indices that correspond to the row indices of the original data. We could wrap this into a function to access this more easily like this:

getOOBRowIndices <- function(forest, i) {
as.integer(rownames(forest$forest.data[[i]]$oob.data))
}

and

getResampledRowIndices <- function(forest, i) {
as.integer(rownames(forest$forest.data[[i]]$bootstrap.data))
}

What do you think?

cjvanlissa commented 2 years ago

Thank you Andreas! That addresses my question. I don't think user-facing functions are a priority for this, but if we make those, I think they should return a n (cases) x t (trees) logical matrix for Boolean indexing; that's faster than indexing by position.

brandmaier / semtree

Is bootstrap / subsampling split stored anywhere? #41