External validation error on a single database with multiple cohort ids as subgroup evaluations.

nbehzad commented 1 year ago

Describe the bug This may not be considered a bug because this type of functionality may not be anticipated during the design of this function.

A single model can be validated on multiple databases using the externalValidationDbPlp function. However, when I want to validate a single model using multiple validationDatabaseDetails with the same database names but different target ids (demonstrating particular subsets of the target cohort), their results are overwritten. This is because the result of external validation is defined solely by the database name and analysis id. This may be fixed by including the target id in the database name when saving the results.

jreps commented 1 year ago

I think, currently the validation loads every model and then saves the results to the Validation/data/developmentModelAnalysisName folder. The assumption was external validation uses the same settings as the development data, but in the recent PLP this was made more flexible. But, as you point out, the saving was not made flexible.

When PLP was originally written, it used the folder names for the shiny app so link the model development and validation, but now it is more advanced and compares the modelDesigns and other settings to know whether development and validation results are linked. This is a long way of saying, we no longer care about the result folder name, so we are not restricted to using the developmentModelAnalysisNames.

So, I think a solution could be:

add a runInfo.csv into each Validation/database folder that contains the columns: 'resultId', 'modelLocation', 'validationTargetId', 'validationOutcomeId', 'validationRestrictDataSetting', 'validationTARsetting'
save each result as 'Analysis_resultId' or 'Result_resultId' inside Validation/database

Then when somebody runs a validation, if the runInfo.csv exists, a new row will be added (with +1 to the resultId) with the validation settings otherwise runInfo.csv will be created with a new row. This will track everything that is validated. The results will be saved with in a folder named either 'Analysis_resultId' or 'Result_resultId' .

This will let people validate multiple times in the same database, but perhaps that is good as the database version may change.

We could also not separate the databases and just add the database to an additional column in runInfo.csv, then everything can be in /Validation.

What do you think?

nbehzad commented 1 year ago

Your solution is excellent, but it will necessitate a major change in this function and may bring new problems for the rest, such as displaying results in the Shiny app. Dividing the validation results within each database folder complicates things because each database folder will have a separate result folder for each model (analyze_id in the plpModel). For three models, we will have three folders: Analysis_1, Analysis_2, and Analysis_3 in each database directory.

I believe the current pipeline is excellent, and we can keep it while making a few minor modifications to resolve the issue without affecting the rest of the package. Therefore, my recommendation is to maintain the current pipeline and save the results as the following strategy:

In the createDatabaseDetails function, we can add a new attribute named "databaseDetailsName." attr(result, "databaseDetailsName") <- databaseDetailsName

createDatabaseDetails function:

function (connectionDetails, cdmDatabaseSchema, cdmDatabaseName, 
  tempEmulationSchema = cdmDatabaseSchema, cohortDatabaseSchema = cdmDatabaseSchema, 
  cohortTable = "cohort", outcomeDatabaseSchema = cdmDatabaseSchema, 
  outcomeTable = "cohort", targetId = NULL, outcomeIds = NULL, 
  cdmVersion = 5, cohortId = NULL) 
{
  if (is.null(targetId)) {
    if (!is.null(cohortId)) {
      ParallelLogger::logWarn("cohortId has been depreciated.  Please use targetId.")
      targetId <- cohortId
    }
  }
  result <- list(connectionDetails = connectionDetails, cdmDatabaseSchema = cdmDatabaseSchema, 
    cdmDatabaseName = cdmDatabaseName, tempEmulationSchema = tempEmulationSchema, 
    cohortDatabaseSchema = cohortDatabaseSchema, cohortTable = cohortTable, 
    outcomeDatabaseSchema = outcomeDatabaseSchema, outcomeTable = outcomeTable, 
    targetId = targetId, outcomeIds = outcomeIds, cdmVersion = cdmVersion)
  attr(result, "cdmDatabaseName") <- cdmDatabaseName
  class(result) <- "databaseDetails"
  return(result)
}

Its default value is a copy of cdmDatabaseName. In this scenario, everything will operate as usual. This value can be used as a folder name during validation. Consequently, anyone can assign any meaningful name to validation databases and create multiple databaseDetails with the same database and different names based on the database version or target id, or anything else. In the externalValidationDbPlp function, we only need to replace cdmDatabaseName with databaseDetailsName in a few lines as shown below:

externalValidationDbPlp function:

  checkIsClass(settings, "validationSettings")
  result <- list()
  length(result) <- length(validationDatabaseDetails)
  names(result) <- unlist(lapply(validationDatabaseDetails, 
    function(x) attr(x, "databaseDetailsName")))
  for (databaseDetails in validationDatabaseDetails) {
    databaseName <- attr(databaseDetails, "cdmDatabaseName")
    databaseDetailsName <- attr(databaseDetails, "databaseDetailsName")
      ...
    result[[databaseDetailsName]] <- tryCatch({
      externalValidatePlp(plpModel, plpData, databaseName = databaseName, 
        population, settings = settings)
    }, error = function(e) {
      ParallelLogger::logInfo(e)
      return(NULL)
    })
     }
     ...
    savePlpResult(result[[databaseDetailsName]], dirPath = file.path(outputFolder, 
        databaseDetailsName, plpModel$trainDetails$analysisId, 
        "validationResult"))

I also agree to add a csv file to the /Validation (not to the Validation/database) directory that summarizes the validation on different databaseDetails settings.

As a result, for each databaseDetailsName we will have a folder including the validation results of all training models, and all validation results will be displayed in the Shiny app without any modification.

jreps commented 1 year ago

I like it! Much easier and fixes the issue.

Do you want to make the edits and make a pull request or shall I make the edits?

nbehzad commented 1 year ago

I would appreciate it if you could make the changes because they may affect parts of the package about which I am unaware, or some additional changes may be required to comply with other functions. As a result, your edits will be more reliable than mine :)

nbehzad commented 1 year ago

I just resolved the issue in my local package. I also noticed and fixed another bug. It was about the validationDatabaseDetails class being checked within the validateMultiplePlp function. I will submit a pull request, and you can look into my changes.

OHDSI / PatientLevelPrediction

External validation error on a single database with multiple cohort ids as subgroup evaluations. #320