OHDSI / CohortDiagnostics

An R package for performing various cohort diagnostics.
https://ohdsi.github.io/CohortDiagnostics
41 stars 49 forks source link

Rethinking runCohortDiagnostics function interface #135

Closed ablack3 closed 2 years ago

ablack3 commented 4 years ago

I am very happy about being able to use the CohortDiagnostics package and really appreciate what has been created. It is a wonderful tool. I'm just throwing out an idea to consider here based on some ideas I learned at a workshop on function design. (Relevant slides are here for anyone interested)

Perhaps the runCohortDiagnostics() interface could be improved. As a user of this function my primary questions are

  1. What data do I need to supply to the function?
  2. What external dependencies need to be met to run the function?
  3. What will this function return or what action will this function perform? (i.e. How will the world be different after running the function?)

The function interface is

runCohortDiagnostics(
  packageName = NULL,
  cohortToCreateFile = "settings/CohortsToCreate.csv",
  baseUrl = NULL,
  cohortSetReference = NULL,
  connectionDetails = NULL,
  connection = NULL,
  cdmDatabaseSchema,
  oracleTempSchema = NULL,
  cohortDatabaseSchema,
  cohortTable = "cohort",
  cohortIds = NULL,
  inclusionStatisticsFolder = NULL,
  exportFolder,
  databaseId,
  databaseName = databaseId,
  databaseDescription = "",
  runInclusionStatistics = TRUE,
  runIncludedSourceConcepts = TRUE,
  runOrphanConcepts = TRUE,
  runTimeDistributions = TRUE,
  runBreakdownIndexEvents = TRUE,
  runIncidenceRate = TRUE,
  runCohortOverlap = TRUE,
  runCohortCharacterization = TRUE,
  covariateSettings = FeatureExtraction::createDefaultCovariateSettings(),
  minCellCount = 5,
  incremental = FALSE,
  incrementalFolder = exportFolder
)

The arguments can be divided into data arguments and detail arguments. Data arguments provide the function with the data it needs to run and detail arguments modify the function's behavior.

What do I need to supply this function?

What external dependencies does this function have?

This function gets its input from external files, possibly and API, a database, and from data explicitly passed in by the caller.

What does this function return or what action does it perform?

This function does not return a value. It performs an action. The result of the action is a new folder containing several csv files and a .zip file which can be read in by other functions in the package. Are any other external settings changed or database tables altered? I don't think so but I'm not 100% sure.


I think this function might be wrapping up too much functionality indicated by several pairs of "either or" arguments. Perhaps this should really be two functions - one for running diagnostics using an installed package and another for using WebAPI. Alternatively the process of running diagnostics could be separated into a few steps:

I'm not sure if there is a possible improvement here but maybe.

gowthamrao commented 3 years ago

@ablack3 can you please try version 2.1 to be released on May 24th - and recommend what we should consider adding to version 2.2 from your issue comment. Maybe we should break it - into chunks?

ablack3 commented 3 years ago

I wish there was a lightweight way to execute cohort diagnostics. The ROhdsiWebApi mode is great if ROhdsiWebApi works. Sometimes ssl certificates are not installed and so ROhdsiWebApi can't be used without IT help. Package mode seems good if you're developing a network study but that involves cloning a repo and restoring an renv.lock file which requires an internet connection.

I'm wondering what the bare minimum requirements are to execute cohort diagnostics. It seems like all you really need is the cohort SQL files and a database connection. It would be nice if we had a way to execute diagnostics that is as minimal as it can possibly be.

gowthamrao commented 3 years ago

@ablack3 how about two new functions

runCohortDiagnosticsInWebApiMode - which will just call runCohortDiagnostics function's webapi mode part - and only accept the parts of runCohortDiagnostics needed to run in WebApi mode. runCohortDiagnosticsInPackageMode - which will just call runCohortDiagnostics function's package mode part - and only accept the parts of runCohortDiagnostics needed to run in package mode.

Using that framework - we can create many other (potentially bare minimum) ways to execute cohort diagnostics - where the new set of runCohortDiagnosticsInXXXXMode functions encapsulate the underlying complexity. The new set of functions can also have in it a set of checks that are unique to the mode e.g. check if it has the set of SQLs needed for the bare minimum execution in the bare minimum mode etc. We can retain the old runCohortDiagnostics (parent) function - ensuring backward compatibility and serving as a router of sorts. We can put a .deprecated message if needed to runCohortDiagnostics parent function.

@azimov what are your thoughts? I am completely sold on this idea, but @ablack3 has a made a compelling argument.


@ablack3 continuing on the bare minimum mode - i would really like to have to future state where we can use Cohort Diagnostics in an 'interactive' mode i.e. build a cohort definition, get its sql, run only incidence rate plot, check if the incidence rate plot passes diagnostics, if not - make changes to cohort definition, re run only incidence rate plot, check if the incidence rate plot passes diagnostics -- iterate till happy. Now go to next diagnostics, e.g. temporal diagnostics, iterate (in each instance we are running one diagnostics for one cohort till 'happy')

ablack3 commented 3 years ago

Great ideas @gowthamrao. I think it does make sense to have several different ways to get diagnostics for different use cases. I really like the idea of functions designed to be used interactively while developing a cohort that run with little overhead. Let me know if there is a piece of this you’d like me to work on implementing.

gowthamrao commented 3 years ago

@ablack3 ofcourse there is a LOT you can help contribute! I just created a new branch

https://github.com/OHDSI/CohortDiagnostics/tree/reDoFunctions

do you want to create two new functions following the framework runCohortDiagnosticsInXXXXMode

runCohortDiagnosticsInWebApiMode - which will just call runCohortDiagnostics function's webapi mode part - and only accept the parts of runCohortDiagnostics needed to run in WebApi mode. runCohortDiagnosticsInPackageMode - which will just call runCohortDiagnostics function's package mode part - and only accept the parts of runCohortDiagnostics needed to run in package mode.

You can updated the test code initially here https://github.com/OHDSI/CohortDiagnostics/tree/reDoFunctions/extras/tests and then in test-that here https://github.com/OHDSI/CohortDiagnostics/tree/reDoFunctions/tests/testthat

gowthamrao commented 3 years ago

After that - we can visit how we can start designing the bare minimum mode/interactive mode.

Do you think thats reasonable?

gowthamrao commented 3 years ago

@ablack3 in another branch, i created a set of functions that i hope are following this design

please see PR here https://github.com/OHDSI/CohortDiagnostics/pull/440

ablack3 commented 3 years ago

do you want to create two new functions following the framework runCohortDiagnosticsInXXXXMode

Sure I will give it a try!

ablack3 commented 3 years ago

In looking at the code a bit more I see more function that have "either or" arguments. They work very differently depending on which arguments are passed in.

instantiateCohortSet works in two different modes and requires different arguments depending on which mode is being used. launchDiagnosticsExplorer might use downloaded data or data that is in a database.

One way to achieve different behavior in R is to use generic functions that will accept different types of objects. The behavior of the function might change depending on what type of object is passed in.

I'm also thinking about all the flag arguments and other ways to get the same behavior. Here is an article that presents an alternative to using flag arguments to control function behavior. https://ardalis.com/are-boolean-flags-on-methods-a-code-smell/

Just some thoughts. I'll work on the new functions over the weekend.

gowthamrao commented 3 years ago

After discussing with @ablack3 moving to 2.3

gowthamrao commented 2 years ago

@ablack3 we are revisiting this issue now! @azimov is leading us forward :)

ablack3 commented 2 years ago

Awesome! Thanks @azimov! I think one pattern that wasn't supported last time I checked is running cohort diagnostics when I have cohort creation SQL scripts in a project that is not an R package.

azimov commented 2 years ago

One core change that I'm going to make with @anthonysena and @chrisknoll is to change the main interface to call cohort diagnostics so that it expects the cohort tables to already exist (after being created with a package such as CohortGenerator). The plan is for this new function, executeCohortDiagnostics to only take a resulting, pre-generated cohort table as many users come from situations where they don't need to generate the cohort data. However, I expect the vignettes will include descriptions on how to do this.

Initially, the refactoring will be to support the current runCohortDiagnostics API without creating any breaking changes, but making a new API that will be the preferred version for new users of the software. The docs will then be easier to interpret. We will throw a deprecation warning for using the current interface and retire it in a breaking release version.

azimov commented 2 years ago

@ablack3 your input on the latest pr #685 would be greatly appreciated. When modifying the code I found that inputting the different sources for where cohort references could come from was unnecessary. My feeling was that this could be a parameter after being loaded (in whatever number of ways you can think of). Primarily, we want users to create study packages so this is the approach that is highlighted in the vignettes.

Beyond this, after instantiating cohorts the vast majority of users will only ever need to execute the package like this:

executeDiagnostics(cohorts, # data.frame reference
                                   connectionDetails = connectionDetails,
                                   cohortTable = cohortTable,
                                   cohortDatabaseSchema = cohortDatabaseSchema,
                                   cdmDatabaseSchema = cdmDatabaseSchema,
                                   exportFolder = exportFolder,
                                   inclusionStatisticsFolder = inclusionStatisticsFolder,
                                   databaseId = "MyCdm")

The other parameters are only really useful if you're doing more complex charachterizations (which will require familiarity with FeatureExtraction) or turning off different diagnostics, which is not recommended if you're trying to properly build phenotypes!

In my opinion, inclusionStatisticsFolder is problematic as its probably confusing to users why this isn't just part of the diagnostics process. I think we should raise a separate issue to resolve this.

chrisknoll commented 2 years ago

Hi, @azimov. We're going to address the inclusionStatisticsFolder when we introduce cohort generation package to this (which will manage all the statistics tables). If/When CD leverages cohort generation, CD will be able to pull the statistics from the results schema (just like any other statistics that are being generated by cohort diagnosis).

Pinging @anthonysena

ablack3 commented 2 years ago

Primarily, we want users to create study packages so this is the approach that is highlighted in the vignettes.

Running cohort diagnostics as part of a larger study package makes sense. What about this pattern of creating a standalone package just for running cohort diagnostics?

examples: https://github.com/ohdsi-studies/Ceeamos/tree/master/CeeamosCohortDiagnostics https://github.com/ohdsi-studies/Glp1ClrdEstimation/tree/master/CohortDiagnostics/HydratedPackage

This seems like a lot of overhead (i.e. filling out all the boilerplate) to run cohort diagnostics. In these two examples I think the DESCRIPTION file is the hydra boilerplate (link, link).

executeDiagnostics(cohorts, # data.frame reference connectionDetails = connectionDetails, cohortTable = cohortTable, cohortDatabaseSchema = cohortDatabaseSchema, cdmDatabaseSchema = cdmDatabaseSchema, exportFolder = exportFolder, inclusionStatisticsFolder = inclusionStatisticsFolder, databaseId = "MyCdm")

This interface seems great to me and using the cohort generation package. Removing the inclusionStatisticsFolder argument seems like a good idea too but I agree with Martijn's comment that we need to be transparent about creating tables in the database and also remember that users may not have write access to their CDM database.

azimov commented 2 years ago

The newexecuteDiagnostics will be included in the next release. runCohortDiagnostics still exists as an interface but there is some debate in my mind about removing it (and the old instantiateCohort code) for the next release as it will break all study packages currently in existence.