Separating individual functions in cohort diagnostics

mdlavallee92 commented 1 year ago

A use case I have for cohort diagnostics is running one of the low-level methods of cohort diagnostics such as getIncidenceRate or getVisitContext for a single cohort instead of executing the entire executeDiagnostics or even for a set of cohorts. Is it within the development scope of Cohort Diagnostics to expose these individual functions?

for example:

#CDM variables
connectionDetails <- DatabaseConnector::createConnectionDetails(
  dbms = "postgresql", 
  server = keyring::key_get("cdm_server"), 
  user = keyring::key_get("cdm_user"), 
  password = keyring::key_get("cdm_password"), 
  port = keyring::key_get("cdm_port"), 
)
cdmDatabaseSchema <- keyring::key_get("cdmSchema")
vocabularyDatabaseSchema <- keyring::key_get("cdmSchema")
cohortDatabaseSchema <- keyring::key_get("writeSchema")
cohortTable <- "cohort"

connectionTest <- DatabaseConnector::connect(connectionDetails)

#create cohort tables 
cohortTableNames <- CohortGenerator::getCohortTableNames(cohortTable = cohortTable)
# Next create the tables on the database
CohortGenerator::createCohortTables(
  connection = connectionTest,
  cohortTableNames = cohortTableNames,
  cohortDatabaseSchema = cohortDatabaseSchema,
  incremental = FALSE
)
# Import cohort csv (id | json | sql)
cohortDefinitionSet <- readr::read_csv("cohortsToCreate.csv")

# Generate the cohort set
CohortGenerator::generateCohortSet(
  connection = connectionTest,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cohortDatabaseSchema = cohortDatabaseSchema,
  cohortTableNames = cohortTableNames,
  cohortDefinitionSet = cohortDefinitionSet,
  incremental = FALSE
)

# get incidence rate
ir <- getIncidenceRate(
  connection = connectionTest,
  cohortDatabaseSchema = cohortDatabaseSchema,
  cohortTable = cohortTableNames$cohort,
  cdmDatabaseSchema = cdmDatabaseSchema,
  vocabularyDatabaseSchema = vocabularyDatabaseSchema,
  tempEmulationSchema = NULL,
  cohortId = cohortDefinitionSet$cohortId
)

plotIr(ir) # my example plotting function for report

mdlavallee92 commented 1 year ago

@azimov we talked about this idea back at the symposium and I have revisited it now. In a fork I have individualized some of these functions. I have tried to add unit tests to the proposed exposed functions. However I am encountering an issue in the testing environment with SqlRender::loadRenderTranslateSql.

Take this code-block. In the testing environment it returns an error even though I know this file exists.

_Cannot find 'GetCalendarYearRange.sql' in the 'sql' or 'sql/sqlserver' folder of the 'CohortDiagnostics' package.

To avoid this issue in run tests, I unpacked the SqlRender::loadRenderTranslateSql function example below:

  sql <- fs::path_package("CohortDiagnostics", "sql/sql_server/GetCalendarYearRange.sql") %>%
    readr::read_file() %>%
    SqlRender::render(cdm_database_schema = cdmDatabaseSchema) %>%
    SqlRender::translate(targetDialect = connection@dbms)

but that would require editing a lot of code-blocks in CohortDiagnostics. Have you encountered this issue in tests?

azimov commented 1 year ago

@mdlavallee92 Is this using devtools::test? devtools::load_all will break the SqlRender loader for some reason related to how it finds the files.

An easy workaround on linux/mac is to create the symbolic link ln -s inst/sql sql or a shortcut folder from inst/sql to the root folder of the project tree.

Doing this should also work with devtools::load_all()

If you're using RCmd check in R studio (a setting you must adjust, it will try devtools by default I think) the tests should then normally find the SQL files.

I was unaware of fs::path_package adding this into SqlRender directly may fix this issue in place of system.file.

azimov commented 1 year ago

@mdlavallee92 - To the main point of this question. I do intend to expose the functions for use outside of the package, however I can't give a clear timeline. I would also vastly prefer to write unit tests for all the functions directly because the current test suite takes a long time to run and errors are often not especially meaningful.

So the steps I would like to see are:

Decide on which functions should be exposed
Add checkmate parameter checking for all functions to expose
Add all required docstrings
Add good unit tests for these functions (ideally ones that actually run quickly)

Until then It should be possible for you to call these functions with triple colon calls,::: e.g. CohortDiagnostics:::getIncidenceRate(...) until this happens. But doing this it will not be possible to guarantee any kind of API stability.

mdlavallee92 commented 1 year ago

@mdlavallee92 Is this using devtools::test? devtools::load_all will break the SqlRender loader for some reason related to how it finds the files.

An easy workaround on linux/mac is to create the symbolic link ln -s inst/sql sql or a shortcut folder from inst/sql to the root folder of the project tree.

Doing this should also work with devtools::load_all()

If you're using RCmd check in R studio (a setting you must adjust, it will try devtools by default I think) the tests should then normally find the SQL files.

I was unaware of fs::path_package adding this into SqlRender directly may fix this issue in place of system.file.

Yeah that is what I am talking about the devtools::test. I tried to swap out system.file with fs::path_package in a SqlRender fork and it still leads to the same issue, not being able to find the file. Probably an issue with devtools and not SqlRender. It seems either function (system.file or fs::path_package ) will work if placed directly in the test but not if sourcing from a function from the SqlRender package, weird. I am working on a windows computer but I will try the shortcut folder see if that works within the tests. Thanks! Back to the primary issue I can help you with this checklist.

mdlavallee92 commented 1 year ago

Context When I have used CohortDiagnostics (CD), I think of it as a pipeline rather than an R package because it a) takes in settings, b) runs the options based on the input settings, c) is "chatty" meaning it tells the user what happened and d) dumps output to a folder to lend itself to a post-processing/visualization step (i.e. your cool RMM pkg). The reason I bring this up is that I think it would be helpful to provide users access to the "worker" methods so they can layer their diagnostics pipeline. For example, I want to run incidence, concept set diagnostics and cohort relationship on a few cohorts. Maybe I want to add my custom diagnostic step to this. I can access the suggested "workers" and build my own pipeline script. By saying this, I don't intend to suggest scrapping the global execute executeDiagnostics. This function is very useful and I use it a lot already since it does everything for me. It is also a target output for your very cool RMM package making it easy to build a dashboad off a few lines of code. Going back to the pipeline mindset, I think this would help unit testing, debugging and also usership. Thinking of the pipeline also gives me ideas of what to target in terms of key functions to expose.

Pipeline

From the executeDiagnostics wrapper, these are the steps I see:

Setup - accumulate settings, checkmate, start logger, prep cohortDefinitionSet
Cdm Info
ObservationPeriod Range
Create concept table - note this is needed to run the concept set diagnostics and visit context
Get Cohort Counts - single function comes from CohortGenerator
Get Inclusion Stats - single function comes from CohortGenerator::getCohortStats
Run Concept Set Diagnostics
- runIncludedSourceConcept
- runOrphanConcepts
- runBreakdownEvents
  1. Get time series
  2. Get Visit Context - note relies on concept set diagnostics run and concept table temp
  3. Get Incidence Rate
  4. Get Cohort Relationship (overlap in the shiny)
  5. Cohort Characterization
    - run baseline
    - run temporal (your chronograph looking tables, -999 to 999 time windows)
  6. Clean-up - drop temp concept table, write tables as csv and zip

In steps 5-12 there are other wrappers to functionalize the "worker" method on multiple cohort definitions, prep them for saving, provide logging information and export. There are a few exceptions such as cohort relationship which requires the enitre set to work.

Suggestion: Expose the singular diagnostic methods in the pipeline

Some of these methods are already exposed (runCohortTimeSeriesDiagnostics), others have a dedictated function but are not exposed or documented (i.e.getIncidenceRate) and others are nested within other functions (i.e. concept set diagnostic functions). This would be my list of functions to expose:

get ObservationPeriod range (nested in executeDiagnostics)
prep Concept table (nested in executeDiagnostics and runConceptSetDiagnostics) - note this is a dependency for the concept set diagnostics and visit context steps, should be handled as a special case
getIncidenceRate (function exists, not exposed)
getVisitContext (function exists, not exposed)
runCohortTimeSeriesDiagnostics (already done)
runCohortRelationshipDiagnostics (already done)
runIncludedSourceConcepts (nested in runConceptSetDiagnostics)
runBreakdownIndexEvents (nested in runConceptSetDiagnostics)
runOrphanConcepts (nested in runConceptSetDiagnostics)

Other steps in the pipeline are already functions from other packages....cohorts counts and inclusion stats from Cohort Generator and Cohort Characterization from FeatureExtraction

What do you think @azimov?

ablack3 commented 1 year ago

I was unaware of fs::path_package adding this into SqlRender directly may fix this issue in place of system.file.

You could also try pkgload:::shim_system.file() in loadRenderTranslateSql which should work with devtools::load_all.

mdlavallee92 commented 1 year ago

Here is the start of the idea I have, needs some further polishing (checkmates etc) but lot of the code stays the same. Just a matter of disentangling the dependencies, particularly with the concept set diagnostics.

In the process of building the unit tests to make sure these work in isolation. Started testing in Eunomia on cohort 17492 from the cohorts to create tests you already have in there.

Take a look and see what you think! No worries if this is in the wrong direction.

azimov commented 1 year ago

@mdlavallee92 I completley agree with the principle of separating this out into distinct parts of a pipeline. Part of the reasoning behind RMM was to allow people to begin extending Ohdsi result sets in a consistent manner.

I can imagine someone extending the pipeline of CohortDiagnostics pretty trivially by adding their own diagnostics functions that they maybe aren't ready (or willing) to share with the wider world but want to include in their results.

Just adding a function and exporting the results according to their model specification and uploaded. With the changes to OhdsiShinyModules and Jenna's ShinyAppBuilder package you can customize which shiny views are added and add your own custom modules that should slot in. Indeed, this is part of the thinking behind the Strageus package, though I expect that the implementation there will not currently meet these needs.

OHDSI / CohortDiagnostics

Separating individual functions in cohort diagnostics #1043