OHDSI / PatientLevelPrediction

An R package for performing patient level prediction in an observational database in the OMOP Common Data Model.
https://ohdsi.github.io/PatientLevelPrediction
181 stars 88 forks source link

New external validation function #431

Closed egillax closed 2 months ago

egillax commented 4 months ago

For discussion @jreps @lhjohn

I made a new function externalValidate to simplify external validation for our network study and in general in Strategus.

This function allows us with one call to externally validate multiple models on multiple prediction problems. It allows either to test all models on all prediction problems or match models to the target cohort they were developed on.

The output file structure will look something like (for two prediction problems with models from two databases validated on the third:

outputFolder └── validationDatabaseName ├── developmentDatabaseName1 │   ├── Analysis_1 │   ├── Analysis_2 ├── developmentDatabaseName2 │   ├── Analysis_1 │   ├── Analysis_2 ├── plpData_Target_11931_Outcome_6243 ├── plpData_Target_11932_Outcome_298 └── validationLog.txt

Currently it's limited in the way that all models have to have used the same covariateSettings since the data is extracted once per prediction problem

codecov[bot] commented 4 months ago

Codecov Report

Attention: Patch coverage is 0% with 69 lines in your changes are missing coverage. Please review.

Project coverage is 88.77%. Comparing base (d9be57a) to head (dc9dd31). Report is 5 commits behind head on develop.

:exclamation: Current head dc9dd31 differs from pull request most recent head d5ca7ee. Consider uploading reports for the commit d5ca7ee to get more accurate results

Files Patch % Lines
R/ExternalValidatePlp.R 0.00% 69 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #431 +/- ## =========================================== - Coverage 89.36% 88.77% -0.59% =========================================== Files 49 49 Lines 9835 9907 +72 =========================================== + Hits 8789 8795 +6 - Misses 1046 1112 +66 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

lhjohn commented 3 months ago

This looks like a good solution. We can adjust it later for additional predictionProblem parameters. We should make explicit in the documentation that the format of predictionProblem is targetId, outcomeId, populationSettings, which I think is currently only evident from the code.

I believe a next step would be to automatically determine the predictionProblem dataframe from the Strategus output files (not necessary for our study). Maybe ModelTransferModule could do that and save it in an additional file.

jreps commented 3 months ago

I'm guessing you are using predictionProblem as input rather than T, O and populationSettings as individual inputs as that way you can use a list of predictionProblems if you wanted to test on different tasks? I wonder whether we want to have the model in the list as well: T/O/Pop settings/model so you can validate lots of models across different tasks (this would be the most flexible)?

Also, we could have something smart that finds all the models for the same T and does the union of the covariateSettings across these models to extract a single plpData object per T?

For the output structure - I like that the data are saved and there is a validation log. Need to check the database upload functions as I think they look for a certain structure (i.e., the validation being inside the runPlp output location). We will want to make sure the save structure is compatible with current db upload code or that the db code can be easily modified for the new structure. I was thinking of modifying runPlp by adding a setting.csv that has columns like: 'analysisId', 'developmentDatabase', 'validationDatabase','modelDesignJson', 'plpDataLocation', 'modelLocation', 'checksum' that can be used as a look-up to determine details about the result folder 'analysis_X' (plus be used for incremental mode to skip models where the settings are unchanged like cohort generator) and then have all the results in the same result folder rather than using a result structure that gives info, if you like this idea, we probably want to do the same for external validation?

egillax commented 2 months ago

Changes in the most recent revision:

instead of using the name "problem" we use validationDesign created using createValidationDesign

The function signature is:

createValidationDesign(
  targetId,
  outcomeId,
  populationSettings,
  restrictPlpDataSettings,
  plpModelList,
  recalibrate = NULL,
  runCovariateSummary = TRUE
)

plpModelList can include either objects loaded with loadPlpModel or paths to such objects. At the moment this only supports a single Target - Outcome pair per validationDesign, but it should ideally also support multiple Outcomes for one Target.

Then you validate with:

validateExternal(
  validationDesignList,
  databaseDetails,
  logSettings,
  outputFolder
)

Ideally I think this function should replace validateMultiplePlp and/or externalValidateDbPlp. I'm open for good names for this new one, we could as well deprecate the old ones in case someone is relying on them for a study.

I've tested this with seven models validated on two databases, or actually validated on my dev database twice (it pretended to be a different one the second time). I've also tested the shinyApp and all seems to work as it should. The output structure now is (shown for a single model, two cohorts in two databases):

outputFolder └── validationDatabase │ ├── Analysis_1 │ ├── Analysis_2 │ ├── targetId_11931_L1 │ ├── targetId_11932_L1 │ ├── validationLog.txt ├── validationDatabaseTwo │ ├── Analysis_1 │ ├── Analysis_2 │ ├── targetId_11931_L1 │ ├── targetId_11932_L1 │ ├── validationLog.txt ├── sqlite │ ├── databaseFile.sqlite

Where the analysis number if sequential and not corresponding to the same one during model development. Current limitation is all models need to have the same covariateSettings. This is enforced and the function will error before extracting the data if this is not true. I'm thinking about how to fix that limitation in the most efficient way.

@jreps this is ready for second review.