Closed ablack3 closed 2 years ago
@ablack3 can you please try version 2.1 to be released on May 24th - and recommend what we should consider adding to version 2.2 from your issue comment. Maybe we should break it - into chunks?
I wish there was a lightweight way to execute cohort diagnostics. The ROhdsiWebApi mode is great if ROhdsiWebApi works. Sometimes ssl certificates are not installed and so ROhdsiWebApi can't be used without IT help. Package mode seems good if you're developing a network study but that involves cloning a repo and restoring an renv.lock file which requires an internet connection.
I'm wondering what the bare minimum requirements are to execute cohort diagnostics. It seems like all you really need is the cohort SQL files and a database connection. It would be nice if we had a way to execute diagnostics that is as minimal as it can possibly be.
@ablack3 how about two new functions
runCohortDiagnosticsInWebApiMode - which will just call runCohortDiagnostics function's webapi mode part - and only accept the parts of runCohortDiagnostics needed to run in WebApi mode. runCohortDiagnosticsInPackageMode - which will just call runCohortDiagnostics function's package mode part - and only accept the parts of runCohortDiagnostics needed to run in package mode.
Using that framework - we can create many other (potentially bare minimum) ways to execute cohort diagnostics - where the new set of runCohortDiagnosticsInXXXXMode functions encapsulate the underlying complexity. The new set of functions can also have in it a set of checks that are unique to the mode e.g. check if it has the set of SQLs needed for the bare minimum execution in the bare minimum mode etc. We can retain the old runCohortDiagnostics (parent) function - ensuring backward compatibility and serving as a router of sorts. We can put a .deprecated message if needed to runCohortDiagnostics parent function.
@azimov what are your thoughts? I am completely sold on this idea, but @ablack3 has a made a compelling argument.
@ablack3 continuing on the bare minimum mode - i would really like to have to future state where we can use Cohort Diagnostics in an 'interactive' mode i.e. build a cohort definition, get its sql, run only incidence rate plot, check if the incidence rate plot passes diagnostics, if not - make changes to cohort definition, re run only incidence rate plot, check if the incidence rate plot passes diagnostics -- iterate till happy. Now go to next diagnostics, e.g. temporal diagnostics, iterate (in each instance we are running one diagnostics for one cohort till 'happy')
Great ideas @gowthamrao. I think it does make sense to have several different ways to get diagnostics for different use cases. I really like the idea of functions designed to be used interactively while developing a cohort that run with little overhead. Let me know if there is a piece of this you’d like me to work on implementing.
@ablack3 ofcourse there is a LOT you can help contribute! I just created a new branch
https://github.com/OHDSI/CohortDiagnostics/tree/reDoFunctions
do you want to create two new functions following the framework runCohortDiagnosticsInXXXXMode
runCohortDiagnosticsInWebApiMode - which will just call runCohortDiagnostics function's webapi mode part - and only accept the parts of runCohortDiagnostics needed to run in WebApi mode. runCohortDiagnosticsInPackageMode - which will just call runCohortDiagnostics function's package mode part - and only accept the parts of runCohortDiagnostics needed to run in package mode.
You can updated the test code initially here https://github.com/OHDSI/CohortDiagnostics/tree/reDoFunctions/extras/tests and then in test-that here https://github.com/OHDSI/CohortDiagnostics/tree/reDoFunctions/tests/testthat
After that - we can visit how we can start designing the bare minimum mode/interactive mode.
Do you think thats reasonable?
@ablack3 in another branch, i created a set of functions that i hope are following this design
please see PR here https://github.com/OHDSI/CohortDiagnostics/pull/440
do you want to create two new functions following the framework runCohortDiagnosticsInXXXXMode
Sure I will give it a try!
In looking at the code a bit more I see more function that have "either or" arguments. They work very differently depending on which arguments are passed in.
instantiateCohortSet
works in two different modes and requires different arguments depending on which mode is being used.
launchDiagnosticsExplorer
might use downloaded data or data that is in a database.
One way to achieve different behavior in R is to use generic functions that will accept different types of objects. The behavior of the function might change depending on what type of object is passed in.
I'm also thinking about all the flag arguments and other ways to get the same behavior. Here is an article that presents an alternative to using flag arguments to control function behavior. https://ardalis.com/are-boolean-flags-on-methods-a-code-smell/
Just some thoughts. I'll work on the new functions over the weekend.
After discussing with @ablack3 moving to 2.3
@ablack3 we are revisiting this issue now! @azimov is leading us forward :)
Awesome! Thanks @azimov! I think one pattern that wasn't supported last time I checked is running cohort diagnostics when I have cohort creation SQL scripts in a project that is not an R package.
One core change that I'm going to make with @anthonysena and @chrisknoll is to change the main interface to call cohort diagnostics so that it expects the cohort tables to already exist (after being created with a package such as CohortGenerator). The plan is for this new function, executeCohortDiagnostics
to only take a resulting, pre-generated cohort table as many users come from situations where they don't need to generate the cohort data. However, I expect the vignettes will include descriptions on how to do this.
Initially, the refactoring will be to support the current runCohortDiagnostics
API without creating any breaking changes, but making a new API that will be the preferred version for new users of the software. The docs will then be easier to interpret. We will throw a deprecation warning for using the current interface and retire it in a breaking release version.
@ablack3 your input on the latest pr #685 would be greatly appreciated. When modifying the code I found that inputting the different sources for where cohort references could come from was unnecessary. My feeling was that this could be a parameter after being loaded (in whatever number of ways you can think of). Primarily, we want users to create study packages so this is the approach that is highlighted in the vignettes.
Beyond this, after instantiating cohorts the vast majority of users will only ever need to execute the package like this:
executeDiagnostics(cohorts, # data.frame reference
connectionDetails = connectionDetails,
cohortTable = cohortTable,
cohortDatabaseSchema = cohortDatabaseSchema,
cdmDatabaseSchema = cdmDatabaseSchema,
exportFolder = exportFolder,
inclusionStatisticsFolder = inclusionStatisticsFolder,
databaseId = "MyCdm")
The other parameters are only really useful if you're doing more complex charachterizations (which will require familiarity with FeatureExtraction) or turning off different diagnostics, which is not recommended if you're trying to properly build phenotypes!
In my opinion, inclusionStatisticsFolder
is problematic as its probably confusing to users why this isn't just part of the diagnostics process. I think we should raise a separate issue to resolve this.
Hi, @azimov. We're going to address the inclusionStatisticsFolder when we introduce cohort generation package to this (which will manage all the statistics tables). If/When CD leverages cohort generation, CD will be able to pull the statistics from the results schema (just like any other statistics that are being generated by cohort diagnosis).
Pinging @anthonysena
Primarily, we want users to create study packages so this is the approach that is highlighted in the vignettes.
Running cohort diagnostics as part of a larger study package makes sense. What about this pattern of creating a standalone package just for running cohort diagnostics?
examples: https://github.com/ohdsi-studies/Ceeamos/tree/master/CeeamosCohortDiagnostics https://github.com/ohdsi-studies/Glp1ClrdEstimation/tree/master/CohortDiagnostics/HydratedPackage
This seems like a lot of overhead (i.e. filling out all the boilerplate) to run cohort diagnostics. In these two examples I think the DESCRIPTION file is the hydra boilerplate (link, link).
executeDiagnostics(cohorts, # data.frame reference connectionDetails = connectionDetails, cohortTable = cohortTable, cohortDatabaseSchema = cohortDatabaseSchema, cdmDatabaseSchema = cdmDatabaseSchema, exportFolder = exportFolder, inclusionStatisticsFolder = inclusionStatisticsFolder, databaseId = "MyCdm")
This interface seems great to me and using the cohort generation package. Removing the inclusionStatisticsFolder argument seems like a good idea too but I agree with Martijn's comment that we need to be transparent about creating tables in the database and also remember that users may not have write access to their CDM database.
The newexecuteDiagnostics
will be included in the next release. runCohortDiagnostics
still exists as an interface but there is some debate in my mind about removing it (and the old instantiateCohort
code) for the next release as it will break all study packages currently in existence.
I am very happy about being able to use the CohortDiagnostics package and really appreciate what has been created. It is a wonderful tool. I'm just throwing out an idea to consider here based on some ideas I learned at a workshop on function design. (Relevant slides are here for anyone interested)
Perhaps the runCohortDiagnostics() interface could be improved. As a user of this function my primary questions are
The function interface is
The arguments can be divided into data arguments and detail arguments. Data arguments provide the function with the data it needs to run and detail arguments modify the function's behavior.
What do I need to supply this function?
What external dependencies does this function have?
This function gets its input from external files, possibly and API, a database, and from data explicitly passed in by the caller.
What does this function return or what action does it perform?
This function does not return a value. It performs an action. The result of the action is a new folder containing several csv files and a .zip file which can be read in by other functions in the package. Are any other external settings changed or database tables altered? I don't think so but I'm not 100% sure.
I think this function might be wrapping up too much functionality indicated by several pairs of "either or" arguments. Perhaps this should really be two functions - one for running diagnostics using an installed package and another for using WebAPI. Alternatively the process of running diagnostics could be separated into a few steps:
I'm not sure if there is a possible improvement here but maybe.