OHDSI / CohortGenerator

An R package for instantiating cohorts using data in the CDM.
https://ohdsi.github.io/CohortGenerator/
11 stars 10 forks source link

getCohortCounts does not return counts for cohorts with 0 subjects #82

Closed ablack3 closed 1 year ago

ablack3 commented 1 year ago

getCohortCounts

Computes the subject and entry count per cohort. Note the cohortDefinitionSet parameter is optional - if you specify the cohortDefinitionSet, the cohort counts will be joined to the cohortDefinitionSet to include attributes like the cohortName.

It returns "A data frame with cohort counts"

But if the counts are 0 then it simply leaves them out rather than reporting the count as 0. getCohortCounts does not seem to make a distinction between cohorts that are empty and cohorts that were never generated.

# devtools::install_github("ohdsi-studies/IbdCharacterization")
library(CohortGenerator)
#> Loading required package: DatabaseConnector

# Not really sure why this does not work as it seems like it should.
cohortDefinitionSet <- getCohortDefinitionSet(packageName = "IbdCharacterization",
                                              settingsFileName = "CohortsToCreateIBD.csv")
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger. (This message will not be shown again this R session)
#> Loading cohortDefinitionSet
#> Error: '' does not exist in current working directory ('/private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/RtmpvF24zI/reprex-e4c0403503bd-awful-dog').

# Create the settings file manually

library(dplyr)

cohortDefinitionSet <- readr::read_csv(system.file(file.path("settings", "CohortsToCreateIBD.csv"), package = "IbdCharacterization"), show_col_types = F) %>% 
  mutate(jsonFilename = system.file(file.path("cohorts", paste0(cohortId, ".json")), package = "IbdCharacterization", mustWork = T),
         sqlFilename = system.file(file.path("sql", "sql_server", paste0(cohortId, ".sql")), package = "IbdCharacterization", mustWork = T),
         json = purrr::map_chr(jsonFilename, readr::read_file),
         sql = purrr::map_chr(sqlFilename, readr::read_file)) %>% 
  select(name, cohortName = atlasName, atlasId, cohortId, json, sql)

# Get the Eunomia connection details
connectionDetails <- Eunomia::getEunomiaConnectionDetails()

# First get the cohort table names to use for this generation task
cohortTableNames <- getCohortTableNames(cohortTable = "cg_example")

# Next create the tables on the database
createCohortTables(connectionDetails = connectionDetails,
                   cohortTableNames = cohortTableNames,
                   cohortDatabaseSchema = "main")
#> Connecting using SQLite driver
#> Creating cohort tables
#> - Created table main.cg_example
#> - Created table main.cg_example_inclusion
#> - Created table main.cg_example_inclusion_result
#> - Created table main.cg_example_inclusion_stats
#> - Created table main.cg_example_summary_stats
#> - Created table main.cg_example_censor_stats
#> Creating cohort tables took 0.35secs

# Generate the cohort set
cohortsGenerated <- generateCohortSet(connectionDetails = connectionDetails,
                                      cdmDatabaseSchema = "main",
                                      cohortDatabaseSchema = "main",
                                      cohortTableNames = cohortTableNames,
                                      cohortDefinitionSet = cohortDefinitionSet)
#> Connecting using SQLite driver
#> 1/6- Generating cohort: [IBD ID243 V1] IBD, incidence cohort [OHDSI]  |                                                                             |======================================================================| 100%
#> Executing SQL took 0.564 secs
#> 2/6- Generating cohort: [IBD ID244 V1] IBD, prevalence cohort [OHDSI]

#> Executing SQL took 0.416 secs
#> 3/6- Generating cohort: [IBD ID362 V1] IBD-U both disease codes incidence cohort [OHDSI]                                                                          |======================================================================| 100%
#> Executing SQL took 0.65 secs
#> 4/6- Generating cohort: [IBD ID414 V1] IBD-U both disease codes prevalence cohort [OHDSI]
                                                                         |======================================================================| 100%
#> Executing SQL took 0.501 secs
#> 5/6- Generating cohort: [IBD ID363 V1] IBD-U no disease code, incidence cohort [OHDSI]
                                                                     |======================================================================| 100%
#> Executing SQL took 0.645 secs
#> 6/6- Generating cohort: [IBD ID413 V1] IBD-U no disease code, prevalence cohort [OHDSI]                                                                          |======================================================================| 100%
#> Executing SQL took 0.523 secs
#> Generating cohort set took 4.41 secs

actual result

getCohortCounts(connectionDetails,
                cohortDatabaseSchema = "main",
                cohortDefinitionSet = cohortDefinitionSet)
#> Connecting using SQLite driver
#> Counting cohorts took 0.0558 secs
#> [1] cohortId       cohortEntries  cohortSubjects name           cohortName    
#> [6] atlasId        json           sql           
#> <0 rows> (or 0-length row.names)

expected result

cohortDefinitionSet %>% 
  mutate(cohortEntries = 0,
         cohortSubjects = 0)
#> # A tibble: 6 × 8
#>   name                       cohor…¹ atlasId cohor…² json  sql   cohor…³ cohor…⁴
#>   <chr>                      <chr>     <dbl>   <dbl> <chr> <chr>   <dbl>   <dbl>
#> 1 [KI-IBD] Persons with IBD… [IBD I… 1776609     306 "{\n… "CRE…       0       0
#> 2 [KI-IBD] Persons with IBD… [IBD I… 1776610     307 "{\n… "CRE…       0       0
#> 3 [KI-IBD] Undetermined IBD… [IBD I… 1777395     308 "{\n… "CRE…       0       0
#> 4 [KI-IBD] Undetermined IBD… [IBD I… 1777396     309 "{\n… "CRE…       0       0
#> 5 [KI-IBD] Undetermined IBD… [IBD I… 1777393     310 "{\n… "CRE…       0       0
#> 6 [KI-IBD] Undetermined IBD… [IBD I… 1777394     311 "{\n… "CRE…       0       0
#> # … with abbreviated variable names ¹​cohortName, ²​cohortId, ³​cohortEntries,
#> #   ⁴​cohortSubjects
anthonysena commented 1 year ago

I think this would be a useful change if someone is interested in contributing to the development. At the moment, the absence of data in the cohort table that yields no rows in the cohort count should be considered a count of 0 people. Knowing if each cohort has been generated is a separate task and is part of the return value when using generateCohortSet as describe here: https://ohdsi.github.io/CohortGenerator/reference/generateCohortSet.html

ablack3 commented 1 year ago

I’ll share it with the OSC group and see if anyone is interested. If not, I’ll give it a shot and create a PR.

javier-gracia-tabuenca-tuni commented 1 year ago

Hey @ablack3, im gonna give it a try, let me know in case you already fix it

javier-gracia-tabuenca-tuni commented 1 year ago

let m know if this is what you wanted #91

ablack3 commented 1 year ago

Nice work @javier-gracia-tabuenca-tuni!