Random sample functionality

azimov commented 7 months ago

Brief description of design goals implemented here:

Allow the generation of samples of a given cohort set (either in the original cohort table or a new cohort_sample table)
Sample all records at the subject level in a cohort, not the events
Allow configurable pseudo-random seeds
Use R for a consistent source of randomness that can be used in a cross platform manner
produce an object that is functionally identical to an original cohort definition set but only contains the sampled cohorts to allow use in any analytics pipeline, either to test studies quickly or allow sampling

Future goal:

Allow usage in strategus study specifications

azimov commented 7 months ago

The Redshift tests seem to fail for reasons unrelated to this PR

codecov[bot] commented 7 months ago

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (78757f1) 98.29% compared to head (e3c7bab) 97.69%.

Files	Patch %	Lines
R/CohortSample.R	91.42%	12 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## develop #129 +/- ## =========================================== - Coverage 98.29% 97.69% -0.60% =========================================== Files 12 13 +1 Lines 1407 1562 +155 =========================================== + Hits 1383 1526 +143 - Misses 24 36 +12 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

anthonysena commented 7 months ago

@azimov thanks for putting this together! Love the concept you've put forward here and just playing with this branch. Let me share the code I used (based on this branch) and wanted to confirm a few items about the behavior and then ask some questions before diving into the coding:

library(CohortGenerator)
#> Loading required package: DatabaseConnector
#> Loading required package: R6

# Create the sample cohort definition Set
cds <- CohortGenerator::createEmptyCohortDefinitionSet()
cohortJsonFiles <- list.files(path = system.file("testdata/name/cohorts", package = "CohortGenerator"), full.names = TRUE)
for (i in 1:length(cohortJsonFiles)) {
  cohortJsonFileName <- cohortJsonFiles[i]
  cohortName <- tools::file_path_sans_ext(basename(cohortJsonFileName))
  cohortJson <- readChar(cohortJsonFileName, file.info(cohortJsonFileName)$size)
  cohortExpression <- CirceR::cohortExpressionFromJson(cohortJson)
  cohortSql <- CirceR::buildCohortQuery(cohortExpression, options = CirceR::createGenerateOptions(generateStats = FALSE))
  cds <- rbind(cds, data.frame(cohortId = i,
                               cohortName = cohortName, 
                               sql = cohortSql,
                               stringsAsFactors = FALSE))
}

# Create the cohorts on Eunomia
connectionDetails <- Eunomia::getEunomiaConnectionDetails()
conn <- DatabaseConnector::connect(connectionDetails = connectionDetails)
#> Connecting using SQLite driver
cohortTableNames <- getCohortTableNames(cohortTable = "cohort")
recordKeepingFolder <- file.path(getwd(), "extras/RecordKeepingSamples")

# Show the cohort table names
print(str(cohortTableNames))
#> List of 7
#>  $ cohortTable               : chr "cohort"
#>  $ cohortSampleTable         : chr "cohort"
#>  $ cohortInclusionTable      : chr "cohort_inclusion"
#>  $ cohortInclusionResultTable: chr "cohort_inclusion_result"
#>  $ cohortInclusionStatsTable : chr "cohort_inclusion_stats"
#>  $ cohortSummaryStatsTable   : chr "cohort_summary_stats"
#>  $ cohortCensorStatsTable    : chr "cohort_censor_stats"
#> NULL

createCohortTables(
  connectionDetails = connectionDetails,
  cohortDatabaseSchema = "main",
  cohortTableNames = cohortTableNames
)
#> Connecting using SQLite driver
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger. (This message will not be shown again this R session)
#> Creating cohort tables
#> - Created table main.cohort
#> - Created table main.cohort
#> - Created table main.cohort_inclusion
#> - Created table main.cohort_inclusion_result
#> - Created table main.cohort_inclusion_stats
#> - Created table main.cohort_summary_stats
#> - Created table main.cohort_censor_stats
#> Creating cohort tables took 0.42secs

generateCohortSet(
  cohortDefinitionSet = cds,
  connection = conn,
  cdmDatabaseSchema = "main",
  cohortDatabaseSchema = "main",
  cohortTableNames = cohortTableNames,
  incremental = TRUE,
  incrementalFolder = recordKeepingFolder
)
#> 1/3- Generating cohort: celecoxib
#>   |                                                                              |                                                                      |   0%  |                                                                              |===                                                                   |   4%  |                                                                              |======                                                                |   8%  |                                                                              |========                                                              |  12%  |                                                                              |===========                                                           |  16%  |                                                                              |==============                                                        |  20%  |                                                                              |=================                                                     |  24%  |                                                                              |====================                                                  |  28%  |                                                                              |======================                                                |  32%  |                                                                              |=========================                                             |  36%  |                                                                              |============================                                          |  40%  |                                                                              |===============================                                       |  44%  |                                                                              |==================================                                    |  48%  |                                                                              |====================================                                  |  52%  |                                                                              |=======================================                               |  56%  |                                                                              |==========================================                            |  60%  |                                                                              |=============================================                         |  64%  |                                                                              |================================================                      |  68%  |                                                                              |==================================================                    |  72%  |                                                                              |=====================================================                 |  76%  |                                                                              |========================================================              |  80%  |                                                                              |===========================================================           |  84%  |                                                                              |==============================================================        |  88%  |                                                                              |================================================================      |  92%  |                                                                              |===================================================================   |  96%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0745 secs
#> 2/3- Generating cohort: celecoxibAge40
#>   |                                                                              |                                                                      |   0%  |                                                                              |==                                                                    |   3%  |                                                                              |=====                                                                 |   7%  |                                                                              |=======                                                               |  10%  |                                                                              |=========                                                             |  13%  |                                                                              |============                                                          |  17%  |                                                                              |==============                                                        |  20%  |                                                                              |================                                                      |  23%  |                                                                              |===================                                                   |  27%  |                                                                              |=====================                                                 |  30%  |                                                                              |=======================                                               |  33%  |                                                                              |==========================                                            |  37%  |                                                                              |============================                                          |  40%  |                                                                              |==============================                                        |  43%  |                                                                              |=================================                                     |  47%  |                                                                              |===================================                                   |  50%  |                                                                              |=====================================                                 |  53%  |                                                                              |========================================                              |  57%  |                                                                              |==========================================                            |  60%  |                                                                              |============================================                          |  63%  |                                                                              |===============================================                       |  67%  |                                                                              |=================================================                     |  70%  |                                                                              |===================================================                   |  73%  |                                                                              |======================================================                |  77%  |                                                                              |========================================================              |  80%  |                                                                              |==========================================================            |  83%  |                                                                              |=============================================================         |  87%  |                                                                              |===============================================================       |  90%  |                                                                              |=================================================================     |  93%  |                                                                              |====================================================================  |  97%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.121 secs
#> 3/3- Generating cohort: celecoxibAge40Male
#>   |                                                                              |                                                                      |   0%  |                                                                              |==                                                                    |   3%  |                                                                              |====                                                                  |   6%  |                                                                              |======                                                                |   9%  |                                                                              |========                                                              |  12%  |                                                                              |==========                                                            |  15%  |                                                                              |============                                                          |  18%  |                                                                              |==============                                                        |  21%  |                                                                              |================                                                      |  24%  |                                                                              |===================                                                   |  26%  |                                                                              |=====================                                                 |  29%  |                                                                              |=======================                                               |  32%  |                                                                              |=========================                                             |  35%  |                                                                              |===========================                                           |  38%  |                                                                              |=============================                                         |  41%  |                                                                              |===============================                                       |  44%  |                                                                              |=================================                                     |  47%  |                                                                              |===================================                                   |  50%  |                                                                              |=====================================                                 |  53%  |                                                                              |=======================================                               |  56%  |                                                                              |=========================================                             |  59%  |                                                                              |===========================================                           |  62%  |                                                                              |=============================================                         |  65%  |                                                                              |===============================================                       |  68%  |                                                                              |=================================================                     |  71%  |                                                                              |===================================================                   |  74%  |                                                                              |======================================================                |  76%  |                                                                              |========================================================              |  79%  |                                                                              |==========================================================            |  82%  |                                                                              |============================================================          |  85%  |                                                                              |==============================================================        |  88%  |                                                                              |================================================================      |  91%  |                                                                              |==================================================================    |  94%  |                                                                              |====================================================================  |  97%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.19 secs
#> Generating cohort set took 0.92 secs

# Print the counts
cohortCounts <- getCohortCounts(
  connection = conn,
  cohortDatabaseSchema = "main",
  cohortTable = cohortTableNames$cohortTable
)
#> Counting cohorts took 0.0617 secs
print(cohortCounts)
#>   cohortId cohortEntries cohortSubjects
#> 1        1          1800           1800
#> 2        2           569            569
#> 3        3           266            266

# Perform the cohort sampling and print results
sampledCohortDefinitionSet <- sampleCohortDefinitionSet(
  cohortDefinitionSet = cds,
  connection = conn,
  n = 1000,
  seed = 64374, # OHDSI
  cohortDatabaseSchema = "main",
  cohortTableNames = cohortTableNames,
  incremental = TRUE,
  incrementalFolder = recordKeepingFolder
)
#> Inserting data took 0.0223 secs
#>   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0238 secs
#> Inserting data took 0.0537 secs
#>   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0257 secs
#> Inserting data took 0.0112 secs
#>   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.016 secs
#> Generating sample set took 0.59 secs
print(sampledCohortDefinitionSet[, colnames(sampledCohortDefinitionSet)[colnames(sampledCohortDefinitionSet) != 'sql']])
#>   cohortId                                    cohortName isSample    status
#> 1    65374          celecoxib [SAMPLE seed=64374 n=1000]     TRUE generated
#> 2    66374     celecoxibAge40 [SAMPLE seed=64374 n=1000]     TRUE generated
#> 3    67374 celecoxibAge40Male [SAMPLE seed=64374 n=1000]     TRUE generated
#>   sampleTargetCohortId
#> 1                    1
#> 2                    2
#> 3                    3

cohortCounts <- getCohortCounts(
  connection = conn,
  cohortDatabaseSchema = "main",
  cohortTable = cohortTableNames$cohortTable
)
#> Counting cohorts took 0.046 secs
print(cohortCounts)
#>   cohortId cohortEntries cohortSubjects
#> 1        1          1800           1800
#> 2        2           569            569
#> 3        3           266            266
#> 4    65374          1000           1000
#> 5    66374           569            569
#> 6    67374           266            266

# Cleanup
DatabaseConnector::disconnect(conn)
unlink(recordKeepingFolder, recursive = TRUE, force = TRUE)

^{Created on 2023-11-20 with reprex v2.0.2}

From above:

# Show the cohort table names
print(str(cohortTableNames))
#> List of 7
#>  $ cohortTable               : chr "cohort"
#>  $ cohortSampleTable         : chr "cohort"
#>  $ cohortInclusionTable      : chr "cohort_inclusion"
#>  $ cohortInclusionResultTable: chr "cohort_inclusion_result"
#>  $ cohortInclusionStatsTable : chr "cohort_inclusion_stats"
#>  $ cohortSummaryStatsTable   : chr "cohort_summary_stats"
#>  $ cohortCensorStatsTable    : chr "cohort_censor_stats"
#> NULL

What is the purpose of the cohortSampleTable property added to the cohort table names? Its currently defaulted to the same value as the cohortTable so just wondering if this was added with intention of using it to store samples separately from the main cohorts?

Looking at the cohort counts after sampling:

print(sampledCohortDefinitionSet[, colnames(sampledCohortDefinitionSet)[colnames(sampledCohortDefinitionSet) != 'sql']])
#>   cohortId                                    cohortName isSample    status
#> 1    65374          celecoxib [SAMPLE seed=64374 n=1000]     TRUE generated
#> 2    66374     celecoxibAge40 [SAMPLE seed=64374 n=1000]     TRUE generated
#> 3    67374 celecoxibAge40Male [SAMPLE seed=64374 n=1000]     TRUE generated
#>   sampleTargetCohortId
#> 1                    1
#> 2                    2
#> 3                    3

cohortCounts <- getCohortCounts(
  connection = conn,
  cohortDatabaseSchema = "main",
  cohortTable = cohortTableNames$cohortTable
)
#> Counting cohorts took 0.046 secs
print(cohortCounts)
#>   cohortId cohortEntries cohortSubjects
#> 1        1          1800           1800
#> 2        2           569            569
#> 3        3           266            266
#> 4    65374          1000           1000
#> 5    66374           569            569
#> 6    67374           266            266

Noting that the sampledCohortDefinitionSet provides the mapping between the "base" cohort (1-3 in this example) and the sampled cohorts. And they are both stored in the same cohort table.

A few questions here:

Would we be better served storing the sampled cohort(s) in the cohortSampleTable so that we can retain the original cohort IDs? Just thinking that if part of the purpose here is to run the various HADES analytical packages (with or without Strategus), wouldn't it be easier to write the code using the same cohort IDs as intended for the analysis vs. trying to change IDs based on using a sample of patients? I feel like nearly all of the HADES packages allow for the specification of the cohortTable when calling the various functions that require access to the patient level data.
Should we consider a % for sampling? Just a thought as perhaps it is easier to uniformly apply a % sample and then support a minimum N? Not sure of the implications on your current approach but just putting that out there for consideration.

azimov commented 7 months ago

@sena in response to your questions

What is the purpose of the cohortSampleTable property added to the cohort table names? Its currently defaulted to the same value as the cohortTable so just wondering if this was added with intention of using it to store samples separately from the main cohorts?

Would we be better served storing the sampled cohort(s) in the cohortSampleTable so that we can retain the original cohort IDs? ...

This could be supported but my intention with the ID mapping was to allow users to sample multiple times (this can be achieved because the seed is treated as a vector). The use of a cohort sample table is optional for similar reasons. I have no strong preference on what behaviour should be default here and it wouldn't be much work to change it we thing this is better.

Should we consider a % for sampling? ...

I think this is a good idea and could potentially be the default behaviour also allowing for a sampleN value in cases where a fixed number is preferred.

azimov commented 5 months ago

@anthonysena this PR has now completed everything we discussed. Mac unit tests failed because of a random rJava error (i assume a package update in CRAN) not because of anything in this branch.

OHDSI / CohortGenerator

Random sample functionality #129

Codecov Report