Closed cjvanlissa closed 2 years ago
Indeed, this is an important piece of information and this is why it is preserved in SEM forest results. SEM forests store the permuted datasets in the field forest.data
. This is a list with an entry for each forest. Each entry again is a named list with entries bootstrap.data
for the resampled data (historically, these were bootstraps) and oob.data
for the out-of-bag samples. In these data frames, the row names store indices that correspond to the row indices of the original data. We could wrap this into a function to access this more easily like this:
getOOBRowIndices <- function(forest, i) {
as.integer(rownames(forest$forest.data[[i]]$oob.data))
}
and
getResampledRowIndices <- function(forest, i) {
as.integer(rownames(forest$forest.data[[i]]$bootstrap.data))
}
What do you think?
Thank you Andreas! That addresses my question. I don't think user-facing functions are a priority for this, but if we make those, I think they should return a n (cases) x t (trees) logical matrix for Boolean indexing; that's faster than indexing by position.
I see the bootstrapped datasets, but I do not see which rows of the original data these are. It is desirable to have this information, for example to obtain oob predictions of observed scores per case across all trees.