explicit split settings

jreps commented 1 month ago

At the moment we can split the data into train/test and folds by patientId, rowId or time.

It would be nice to have an explicit splitter where you can provide the rowIds for the test/train/folds. That way you can ensure the same split even with different features etc.

jreps commented 1 month ago

Here is code that seems to work for me:

createExplicitSplitSetting <- function( testRowIds, trainRowIds, trainFolds ){

splitSettings <- list(testRowIds = testRowIds, trainRowIds = trainRowIds, trainFolds = trainFolds )

attr(splitSettings, "fun") <- "explicitSplitter" class(splitSettings) <- "splitSettings" return(splitSettings) }

explicitSplitter <- function( population, splitSettings ) { testRowIds = splitSettings$testRowIds trainRowIds = splitSettings$trainRowIds trainFolds = splitSettings$trainFolds

split <- data.frame( rowId = c(testRowIds,trainRowIds), index = c(rep(-1, length(testRowIds)), trainFolds) )

return(split) }

egillax commented 1 month ago

Does the proposed code also allow for controlling the training folds ? Like if you need to ensure exactly the same split not only into train/test but as well each fold in train.

egillax commented 3 days ago

I added a suggested feature for this in #504

OHDSI / PatientLevelPrediction

explicit split settings #487