Closed azimov closed 2 months ago
The Redshift tests seem to fail for reasons unrelated to this PR
Attention: 12 lines
in your changes are missing coverage. Please review.
Comparison is base (
78757f1
) 98.29% compared to head (e3c7bab
) 97.69%.
Files | Patch % | Lines |
---|---|---|
R/CohortSample.R | 91.42% | 12 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@azimov thanks for putting this together! Love the concept you've put forward here and just playing with this branch. Let me share the code I used (based on this branch) and wanted to confirm a few items about the behavior and then ask some questions before diving into the coding:
library(CohortGenerator)
#> Loading required package: DatabaseConnector
#> Loading required package: R6
# Create the sample cohort definition Set
cds <- CohortGenerator::createEmptyCohortDefinitionSet()
cohortJsonFiles <- list.files(path = system.file("testdata/name/cohorts", package = "CohortGenerator"), full.names = TRUE)
for (i in 1:length(cohortJsonFiles)) {
cohortJsonFileName <- cohortJsonFiles[i]
cohortName <- tools::file_path_sans_ext(basename(cohortJsonFileName))
cohortJson <- readChar(cohortJsonFileName, file.info(cohortJsonFileName)$size)
cohortExpression <- CirceR::cohortExpressionFromJson(cohortJson)
cohortSql <- CirceR::buildCohortQuery(cohortExpression, options = CirceR::createGenerateOptions(generateStats = FALSE))
cds <- rbind(cds, data.frame(cohortId = i,
cohortName = cohortName,
sql = cohortSql,
stringsAsFactors = FALSE))
}
# Create the cohorts on Eunomia
connectionDetails <- Eunomia::getEunomiaConnectionDetails()
conn <- DatabaseConnector::connect(connectionDetails = connectionDetails)
#> Connecting using SQLite driver
cohortTableNames <- getCohortTableNames(cohortTable = "cohort")
recordKeepingFolder <- file.path(getwd(), "extras/RecordKeepingSamples")
# Show the cohort table names
print(str(cohortTableNames))
#> List of 7
#> $ cohortTable : chr "cohort"
#> $ cohortSampleTable : chr "cohort"
#> $ cohortInclusionTable : chr "cohort_inclusion"
#> $ cohortInclusionResultTable: chr "cohort_inclusion_result"
#> $ cohortInclusionStatsTable : chr "cohort_inclusion_stats"
#> $ cohortSummaryStatsTable : chr "cohort_summary_stats"
#> $ cohortCensorStatsTable : chr "cohort_censor_stats"
#> NULL
createCohortTables(
connectionDetails = connectionDetails,
cohortDatabaseSchema = "main",
cohortTableNames = cohortTableNames
)
#> Connecting using SQLite driver
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger. (This message will not be shown again this R session)
#> Creating cohort tables
#> - Created table main.cohort
#> - Created table main.cohort
#> - Created table main.cohort_inclusion
#> - Created table main.cohort_inclusion_result
#> - Created table main.cohort_inclusion_stats
#> - Created table main.cohort_summary_stats
#> - Created table main.cohort_censor_stats
#> Creating cohort tables took 0.42secs
generateCohortSet(
cohortDefinitionSet = cds,
connection = conn,
cdmDatabaseSchema = "main",
cohortDatabaseSchema = "main",
cohortTableNames = cohortTableNames,
incremental = TRUE,
incrementalFolder = recordKeepingFolder
)
#> 1/3- Generating cohort: celecoxib
#> | | | 0% | |=== | 4% | |====== | 8% | |======== | 12% | |=========== | 16% | |============== | 20% | |================= | 24% | |==================== | 28% | |====================== | 32% | |========================= | 36% | |============================ | 40% | |=============================== | 44% | |================================== | 48% | |==================================== | 52% | |======================================= | 56% | |========================================== | 60% | |============================================= | 64% | |================================================ | 68% | |================================================== | 72% | |===================================================== | 76% | |======================================================== | 80% | |=========================================================== | 84% | |============================================================== | 88% | |================================================================ | 92% | |=================================================================== | 96% | |======================================================================| 100%
#> Executing SQL took 0.0745 secs
#> 2/3- Generating cohort: celecoxibAge40
#> | | | 0% | |== | 3% | |===== | 7% | |======= | 10% | |========= | 13% | |============ | 17% | |============== | 20% | |================ | 23% | |=================== | 27% | |===================== | 30% | |======================= | 33% | |========================== | 37% | |============================ | 40% | |============================== | 43% | |================================= | 47% | |=================================== | 50% | |===================================== | 53% | |======================================== | 57% | |========================================== | 60% | |============================================ | 63% | |=============================================== | 67% | |================================================= | 70% | |=================================================== | 73% | |====================================================== | 77% | |======================================================== | 80% | |========================================================== | 83% | |============================================================= | 87% | |=============================================================== | 90% | |================================================================= | 93% | |==================================================================== | 97% | |======================================================================| 100%
#> Executing SQL took 0.121 secs
#> 3/3- Generating cohort: celecoxibAge40Male
#> | | | 0% | |== | 3% | |==== | 6% | |====== | 9% | |======== | 12% | |========== | 15% | |============ | 18% | |============== | 21% | |================ | 24% | |=================== | 26% | |===================== | 29% | |======================= | 32% | |========================= | 35% | |=========================== | 38% | |============================= | 41% | |=============================== | 44% | |================================= | 47% | |=================================== | 50% | |===================================== | 53% | |======================================= | 56% | |========================================= | 59% | |=========================================== | 62% | |============================================= | 65% | |=============================================== | 68% | |================================================= | 71% | |=================================================== | 74% | |====================================================== | 76% | |======================================================== | 79% | |========================================================== | 82% | |============================================================ | 85% | |============================================================== | 88% | |================================================================ | 91% | |================================================================== | 94% | |==================================================================== | 97% | |======================================================================| 100%
#> Executing SQL took 0.19 secs
#> Generating cohort set took 0.92 secs
# Print the counts
cohortCounts <- getCohortCounts(
connection = conn,
cohortDatabaseSchema = "main",
cohortTable = cohortTableNames$cohortTable
)
#> Counting cohorts took 0.0617 secs
print(cohortCounts)
#> cohortId cohortEntries cohortSubjects
#> 1 1 1800 1800
#> 2 2 569 569
#> 3 3 266 266
# Perform the cohort sampling and print results
sampledCohortDefinitionSet <- sampleCohortDefinitionSet(
cohortDefinitionSet = cds,
connection = conn,
n = 1000,
seed = 64374, # OHDSI
cohortDatabaseSchema = "main",
cohortTableNames = cohortTableNames,
incremental = TRUE,
incrementalFolder = recordKeepingFolder
)
#> Inserting data took 0.0223 secs
#> | | | 0% | |================== | 25% | |=================================== | 50% | |==================================================== | 75% | |======================================================================| 100%
#> Executing SQL took 0.0238 secs
#> Inserting data took 0.0537 secs
#> | | | 0% | |================== | 25% | |=================================== | 50% | |==================================================== | 75% | |======================================================================| 100%
#> Executing SQL took 0.0257 secs
#> Inserting data took 0.0112 secs
#> | | | 0% | |================== | 25% | |=================================== | 50% | |==================================================== | 75% | |======================================================================| 100%
#> Executing SQL took 0.016 secs
#> Generating sample set took 0.59 secs
print(sampledCohortDefinitionSet[, colnames(sampledCohortDefinitionSet)[colnames(sampledCohortDefinitionSet) != 'sql']])
#> cohortId cohortName isSample status
#> 1 65374 celecoxib [SAMPLE seed=64374 n=1000] TRUE generated
#> 2 66374 celecoxibAge40 [SAMPLE seed=64374 n=1000] TRUE generated
#> 3 67374 celecoxibAge40Male [SAMPLE seed=64374 n=1000] TRUE generated
#> sampleTargetCohortId
#> 1 1
#> 2 2
#> 3 3
cohortCounts <- getCohortCounts(
connection = conn,
cohortDatabaseSchema = "main",
cohortTable = cohortTableNames$cohortTable
)
#> Counting cohorts took 0.046 secs
print(cohortCounts)
#> cohortId cohortEntries cohortSubjects
#> 1 1 1800 1800
#> 2 2 569 569
#> 3 3 266 266
#> 4 65374 1000 1000
#> 5 66374 569 569
#> 6 67374 266 266
# Cleanup
DatabaseConnector::disconnect(conn)
unlink(recordKeepingFolder, recursive = TRUE, force = TRUE)
Created on 2023-11-20 with reprex v2.0.2
From above:
# Show the cohort table names
print(str(cohortTableNames))
#> List of 7
#> $ cohortTable : chr "cohort"
#> $ cohortSampleTable : chr "cohort"
#> $ cohortInclusionTable : chr "cohort_inclusion"
#> $ cohortInclusionResultTable: chr "cohort_inclusion_result"
#> $ cohortInclusionStatsTable : chr "cohort_inclusion_stats"
#> $ cohortSummaryStatsTable : chr "cohort_summary_stats"
#> $ cohortCensorStatsTable : chr "cohort_censor_stats"
#> NULL
What is the purpose of the cohortSampleTable
property added to the cohort table names? Its currently defaulted to the same value as the cohortTable so just wondering if this was added with intention of using it to store samples separately from the main cohorts?
Looking at the cohort counts after sampling:
print(sampledCohortDefinitionSet[, colnames(sampledCohortDefinitionSet)[colnames(sampledCohortDefinitionSet) != 'sql']])
#> cohortId cohortName isSample status
#> 1 65374 celecoxib [SAMPLE seed=64374 n=1000] TRUE generated
#> 2 66374 celecoxibAge40 [SAMPLE seed=64374 n=1000] TRUE generated
#> 3 67374 celecoxibAge40Male [SAMPLE seed=64374 n=1000] TRUE generated
#> sampleTargetCohortId
#> 1 1
#> 2 2
#> 3 3
cohortCounts <- getCohortCounts(
connection = conn,
cohortDatabaseSchema = "main",
cohortTable = cohortTableNames$cohortTable
)
#> Counting cohorts took 0.046 secs
print(cohortCounts)
#> cohortId cohortEntries cohortSubjects
#> 1 1 1800 1800
#> 2 2 569 569
#> 3 3 266 266
#> 4 65374 1000 1000
#> 5 66374 569 569
#> 6 67374 266 266
Noting that the sampledCohortDefinitionSet
provides the mapping between the "base" cohort (1-3 in this example) and the sampled cohorts. And they are both stored in the same cohort table.
A few questions here:
cohortSampleTable
so that we can retain the original cohort IDs? Just thinking that if part of the purpose here is to run the various HADES analytical packages (with or without Strategus), wouldn't it be easier to write the code using the same cohort IDs as intended for the analysis vs. trying to change IDs based on using a sample of patients? I feel like nearly all of the HADES packages allow for the specification of the cohortTable
when calling the various functions that require access to the patient level data. @sena in response to your questions
What is the purpose of the cohortSampleTable property added to the cohort table names? Its currently defaulted to the same value as the cohortTable so just wondering if this was added with intention of using it to store samples separately from the main cohorts?
Would we be better served storing the sampled cohort(s) in the cohortSampleTable so that we can retain the original cohort IDs? ...
This could be supported but my intention with the ID mapping was to allow users to sample multiple times (this can be achieved because the seed is treated as a vector). The use of a cohort sample table is optional for similar reasons. I have no strong preference on what behaviour should be default here and it wouldn't be much work to change it we thing this is better.
Should we consider a % for sampling? ...
I think this is a good idea and could potentially be the default behaviour also allowing for a sampleN
value in cases where a fixed number is preferred.
@anthonysena this PR has now completed everything we discussed. Mac unit tests failed because of a random rJava error (i assume a package update in CRAN) not because of anything in this branch.
Brief description of design goals implemented here:
Future goal: