OHDSI / CohortGenerator

An R package for instantiating cohorts using data in the CDM.
https://ohdsi.github.io/CohortGenerator/
11 stars 10 forks source link

Random sample functionality #129

Closed azimov closed 2 months ago

azimov commented 7 months ago

Brief description of design goals implemented here:

Future goal:

azimov commented 7 months ago

The Redshift tests seem to fail for reasons unrelated to this PR

codecov[bot] commented 7 months ago

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (78757f1) 98.29% compared to head (e3c7bab) 97.69%.

Files Patch % Lines
R/CohortSample.R 91.42% 12 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #129 +/- ## =========================================== - Coverage 98.29% 97.69% -0.60% =========================================== Files 12 13 +1 Lines 1407 1562 +155 =========================================== + Hits 1383 1526 +143 - Misses 24 36 +12 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

anthonysena commented 7 months ago

@azimov thanks for putting this together! Love the concept you've put forward here and just playing with this branch. Let me share the code I used (based on this branch) and wanted to confirm a few items about the behavior and then ask some questions before diving into the coding:

library(CohortGenerator)
#> Loading required package: DatabaseConnector
#> Loading required package: R6

# Create the sample cohort definition Set
cds <- CohortGenerator::createEmptyCohortDefinitionSet()
cohortJsonFiles <- list.files(path = system.file("testdata/name/cohorts", package = "CohortGenerator"), full.names = TRUE)
for (i in 1:length(cohortJsonFiles)) {
  cohortJsonFileName <- cohortJsonFiles[i]
  cohortName <- tools::file_path_sans_ext(basename(cohortJsonFileName))
  cohortJson <- readChar(cohortJsonFileName, file.info(cohortJsonFileName)$size)
  cohortExpression <- CirceR::cohortExpressionFromJson(cohortJson)
  cohortSql <- CirceR::buildCohortQuery(cohortExpression, options = CirceR::createGenerateOptions(generateStats = FALSE))
  cds <- rbind(cds, data.frame(cohortId = i,
                               cohortName = cohortName, 
                               sql = cohortSql,
                               stringsAsFactors = FALSE))
}

# Create the cohorts on Eunomia
connectionDetails <- Eunomia::getEunomiaConnectionDetails()
conn <- DatabaseConnector::connect(connectionDetails = connectionDetails)
#> Connecting using SQLite driver
cohortTableNames <- getCohortTableNames(cohortTable = "cohort")
recordKeepingFolder <- file.path(getwd(), "extras/RecordKeepingSamples")

# Show the cohort table names
print(str(cohortTableNames))
#> List of 7
#>  $ cohortTable               : chr "cohort"
#>  $ cohortSampleTable         : chr "cohort"
#>  $ cohortInclusionTable      : chr "cohort_inclusion"
#>  $ cohortInclusionResultTable: chr "cohort_inclusion_result"
#>  $ cohortInclusionStatsTable : chr "cohort_inclusion_stats"
#>  $ cohortSummaryStatsTable   : chr "cohort_summary_stats"
#>  $ cohortCensorStatsTable    : chr "cohort_censor_stats"
#> NULL

createCohortTables(
  connectionDetails = connectionDetails,
  cohortDatabaseSchema = "main",
  cohortTableNames = cohortTableNames
)
#> Connecting using SQLite driver
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger. (This message will not be shown again this R session)
#> Creating cohort tables
#> - Created table main.cohort
#> - Created table main.cohort
#> - Created table main.cohort_inclusion
#> - Created table main.cohort_inclusion_result
#> - Created table main.cohort_inclusion_stats
#> - Created table main.cohort_summary_stats
#> - Created table main.cohort_censor_stats
#> Creating cohort tables took 0.42secs

generateCohortSet(
  cohortDefinitionSet = cds,
  connection = conn,
  cdmDatabaseSchema = "main",
  cohortDatabaseSchema = "main",
  cohortTableNames = cohortTableNames,
  incremental = TRUE,
  incrementalFolder = recordKeepingFolder
)
#> 1/3- Generating cohort: celecoxib
#>   |                                                                              |                                                                      |   0%  |                                                                              |===                                                                   |   4%  |                                                                              |======                                                                |   8%  |                                                                              |========                                                              |  12%  |                                                                              |===========                                                           |  16%  |                                                                              |==============                                                        |  20%  |                                                                              |=================                                                     |  24%  |                                                                              |====================                                                  |  28%  |                                                                              |======================                                                |  32%  |                                                                              |=========================                                             |  36%  |                                                                              |============================                                          |  40%  |                                                                              |===============================                                       |  44%  |                                                                              |==================================                                    |  48%  |                                                                              |====================================                                  |  52%  |                                                                              |=======================================                               |  56%  |                                                                              |==========================================                            |  60%  |                                                                              |=============================================                         |  64%  |                                                                              |================================================                      |  68%  |                                                                              |==================================================                    |  72%  |                                                                              |=====================================================                 |  76%  |                                                                              |========================================================              |  80%  |                                                                              |===========================================================           |  84%  |                                                                              |==============================================================        |  88%  |                                                                              |================================================================      |  92%  |                                                                              |===================================================================   |  96%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0745 secs
#> 2/3- Generating cohort: celecoxibAge40
#>   |                                                                              |                                                                      |   0%  |                                                                              |==                                                                    |   3%  |                                                                              |=====                                                                 |   7%  |                                                                              |=======                                                               |  10%  |                                                                              |=========                                                             |  13%  |                                                                              |============                                                          |  17%  |                                                                              |==============                                                        |  20%  |                                                                              |================                                                      |  23%  |                                                                              |===================                                                   |  27%  |                                                                              |=====================                                                 |  30%  |                                                                              |=======================                                               |  33%  |                                                                              |==========================                                            |  37%  |                                                                              |============================                                          |  40%  |                                                                              |==============================                                        |  43%  |                                                                              |=================================                                     |  47%  |                                                                              |===================================                                   |  50%  |                                                                              |=====================================                                 |  53%  |                                                                              |========================================                              |  57%  |                                                                              |==========================================                            |  60%  |                                                                              |============================================                          |  63%  |                                                                              |===============================================                       |  67%  |                                                                              |=================================================                     |  70%  |                                                                              |===================================================                   |  73%  |                                                                              |======================================================                |  77%  |                                                                              |========================================================              |  80%  |                                                                              |==========================================================            |  83%  |                                                                              |=============================================================         |  87%  |                                                                              |===============================================================       |  90%  |                                                                              |=================================================================     |  93%  |                                                                              |====================================================================  |  97%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.121 secs
#> 3/3- Generating cohort: celecoxibAge40Male
#>   |                                                                              |                                                                      |   0%  |                                                                              |==                                                                    |   3%  |                                                                              |====                                                                  |   6%  |                                                                              |======                                                                |   9%  |                                                                              |========                                                              |  12%  |                                                                              |==========                                                            |  15%  |                                                                              |============                                                          |  18%  |                                                                              |==============                                                        |  21%  |                                                                              |================                                                      |  24%  |                                                                              |===================                                                   |  26%  |                                                                              |=====================                                                 |  29%  |                                                                              |=======================                                               |  32%  |                                                                              |=========================                                             |  35%  |                                                                              |===========================                                           |  38%  |                                                                              |=============================                                         |  41%  |                                                                              |===============================                                       |  44%  |                                                                              |=================================                                     |  47%  |                                                                              |===================================                                   |  50%  |                                                                              |=====================================                                 |  53%  |                                                                              |=======================================                               |  56%  |                                                                              |=========================================                             |  59%  |                                                                              |===========================================                           |  62%  |                                                                              |=============================================                         |  65%  |                                                                              |===============================================                       |  68%  |                                                                              |=================================================                     |  71%  |                                                                              |===================================================                   |  74%  |                                                                              |======================================================                |  76%  |                                                                              |========================================================              |  79%  |                                                                              |==========================================================            |  82%  |                                                                              |============================================================          |  85%  |                                                                              |==============================================================        |  88%  |                                                                              |================================================================      |  91%  |                                                                              |==================================================================    |  94%  |                                                                              |====================================================================  |  97%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.19 secs
#> Generating cohort set took 0.92 secs

# Print the counts
cohortCounts <- getCohortCounts(
  connection = conn,
  cohortDatabaseSchema = "main",
  cohortTable = cohortTableNames$cohortTable
)
#> Counting cohorts took 0.0617 secs
print(cohortCounts)
#>   cohortId cohortEntries cohortSubjects
#> 1        1          1800           1800
#> 2        2           569            569
#> 3        3           266            266

# Perform the cohort sampling and print results
sampledCohortDefinitionSet <- sampleCohortDefinitionSet(
  cohortDefinitionSet = cds,
  connection = conn,
  n = 1000,
  seed = 64374, # OHDSI
  cohortDatabaseSchema = "main",
  cohortTableNames = cohortTableNames,
  incremental = TRUE,
  incrementalFolder = recordKeepingFolder
)
#> Inserting data took 0.0223 secs
#>   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0238 secs
#> Inserting data took 0.0537 secs
#>   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0257 secs
#> Inserting data took 0.0112 secs
#>   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.016 secs
#> Generating sample set took 0.59 secs
print(sampledCohortDefinitionSet[, colnames(sampledCohortDefinitionSet)[colnames(sampledCohortDefinitionSet) != 'sql']])
#>   cohortId                                    cohortName isSample    status
#> 1    65374          celecoxib [SAMPLE seed=64374 n=1000]     TRUE generated
#> 2    66374     celecoxibAge40 [SAMPLE seed=64374 n=1000]     TRUE generated
#> 3    67374 celecoxibAge40Male [SAMPLE seed=64374 n=1000]     TRUE generated
#>   sampleTargetCohortId
#> 1                    1
#> 2                    2
#> 3                    3

cohortCounts <- getCohortCounts(
  connection = conn,
  cohortDatabaseSchema = "main",
  cohortTable = cohortTableNames$cohortTable
)
#> Counting cohorts took 0.046 secs
print(cohortCounts)
#>   cohortId cohortEntries cohortSubjects
#> 1        1          1800           1800
#> 2        2           569            569
#> 3        3           266            266
#> 4    65374          1000           1000
#> 5    66374           569            569
#> 6    67374           266            266

# Cleanup
DatabaseConnector::disconnect(conn)
unlink(recordKeepingFolder, recursive = TRUE, force = TRUE)

Created on 2023-11-20 with reprex v2.0.2

From above:

# Show the cohort table names
print(str(cohortTableNames))
#> List of 7
#>  $ cohortTable               : chr "cohort"
#>  $ cohortSampleTable         : chr "cohort"
#>  $ cohortInclusionTable      : chr "cohort_inclusion"
#>  $ cohortInclusionResultTable: chr "cohort_inclusion_result"
#>  $ cohortInclusionStatsTable : chr "cohort_inclusion_stats"
#>  $ cohortSummaryStatsTable   : chr "cohort_summary_stats"
#>  $ cohortCensorStatsTable    : chr "cohort_censor_stats"
#> NULL

What is the purpose of the cohortSampleTable property added to the cohort table names? Its currently defaulted to the same value as the cohortTable so just wondering if this was added with intention of using it to store samples separately from the main cohorts?

Looking at the cohort counts after sampling:

print(sampledCohortDefinitionSet[, colnames(sampledCohortDefinitionSet)[colnames(sampledCohortDefinitionSet) != 'sql']])
#>   cohortId                                    cohortName isSample    status
#> 1    65374          celecoxib [SAMPLE seed=64374 n=1000]     TRUE generated
#> 2    66374     celecoxibAge40 [SAMPLE seed=64374 n=1000]     TRUE generated
#> 3    67374 celecoxibAge40Male [SAMPLE seed=64374 n=1000]     TRUE generated
#>   sampleTargetCohortId
#> 1                    1
#> 2                    2
#> 3                    3

cohortCounts <- getCohortCounts(
  connection = conn,
  cohortDatabaseSchema = "main",
  cohortTable = cohortTableNames$cohortTable
)
#> Counting cohorts took 0.046 secs
print(cohortCounts)
#>   cohortId cohortEntries cohortSubjects
#> 1        1          1800           1800
#> 2        2           569            569
#> 3        3           266            266
#> 4    65374          1000           1000
#> 5    66374           569            569
#> 6    67374           266            266

Noting that the sampledCohortDefinitionSet provides the mapping between the "base" cohort (1-3 in this example) and the sampled cohorts. And they are both stored in the same cohort table.

A few questions here:

azimov commented 7 months ago

@sena in response to your questions

What is the purpose of the cohortSampleTable property added to the cohort table names? Its currently defaulted to the same value as the cohortTable so just wondering if this was added with intention of using it to store samples separately from the main cohorts?

Would we be better served storing the sampled cohort(s) in the cohortSampleTable so that we can retain the original cohort IDs? ...

This could be supported but my intention with the ID mapping was to allow users to sample multiple times (this can be achieved because the seed is treated as a vector). The use of a cohort sample table is optional for similar reasons. I have no strong preference on what behaviour should be default here and it wouldn't be much work to change it we thing this is better.

Should we consider a % for sampling? ...

I think this is a good idea and could potentially be the default behaviour also allowing for a sampleN value in cases where a fixed number is preferred.

azimov commented 5 months ago

@anthonysena this PR has now completed everything we discussed. Mac unit tests failed because of a random rJava error (i assume a package update in CRAN) not because of anything in this branch.