OHDSI / Strategus

[Under development] An R packages for coordinating and executing analytics using HADES modules
https://ohdsi.github.io/Strategus/
6 stars 12 forks source link

Refactor Strategus module approach #145

Closed anthonysena closed 3 months ago

anthonysena commented 4 months ago

This draft pull request aims to address the following issues:

The current design is based on discussion in the Strategus HADES WG and I'll describe the changes in detail.

Overview

The major aim of v1.0 of Strategus is to in-source the modules that currently reside in their own code repositories (e.g. CohortGenerator module). To do this, we aim to provide a set of common interfaces that all modules will implement to perform their specific analytical use case. In addition, we'd like to remove the keyring, renv and targets dependencies as they created too much complexity and headaches. We will still use renv for dependency management; we will remove the use of renv::run() from Strategus and may still retain this dependency if we want to ensure that a project has properly initialized renv, etc.

This branch & PR assume that the structure of the analysis specification remains unchanged, mainly for my own sanity while working out these changes. More discussion on the structure of the analysis specification is happening in #144.

R/Module-StrategusModule.R

This file holds R6 Classes that aim to provide a common interface (base class in the case of R6) for all StrategusModules.

I'll note that in the function signatures above, connectionDetails and resultsConnectionDetails are purposely separated from the execution settings so that we can serialize the execution settings without the worry that it may include sensitive information. This replaces the behavior from Strategus v0.x in which we had a connection details reference that we used to grab the credentials from keyring.

The StrategusModules base class does have some base implementation that is worth pointing out here:

I've included the following modules:

Test Harness

There is a sample script in extras/R6ClassFun.R that aims to provide a simple test harness that will: create an analysis specification (only for CG and CM for now) , execute the analysis via Strategus, create an empty SQLite database, create the results schema tables & upload results. There is also some code to launch a Shiny results viewer to inspect the results. This has been my primary way of testing these changes and is a good reference for the main changes in this branch. To point out a few specific items:

Creating module settings: This is now done by creating an instance of the module class. For example:

cgModuleSettingsCreator <- CohortGeneratorModule$new()
# Create the settings & validate them
cohortSharedResourcesSpecifications <- cgModuleSettingsCreator$createCohortSharedResourceSpecifications(cohortDefinitionSet)
cgModuleSettingsCreator$validateCohortSharedResourceSpecifications(cohortSharedResourcesSpecifications)

ncoCohortSharedResourceSpecifications <- cgModuleSettingsCreator$createNegativeControlOutcomeCohortSharedResourceSpecifications(ncoCohortSet, "first", TRUE)
cgModuleSettingsCreator$validateNegativeControlOutcomeCohortSharedResourceSpecifications(ncoCohortSharedResourceSpecifications)

cgModuleSettings <- cgModuleSettingsCreator$createModuleSpecifications()
cgModuleSettingsCreator$validateModuleSpecifications(cgModuleSettings)

This removes the need to source("/SettingsFunction.R") from the module's repo as was done in v0.x. Additionally some basic validation may be performed by the module (as was desired in #9 but what is in this branch is not even close to a solution to that state).

Development Notes

I welcome any feedback/suggestions - tagging some of the HADES Strategus WG members including @schuemie @azimov @chrisknoll @jreps.

schuemie commented 4 months ago

Excellent work @anthonysena ! This seems to cover pretty much everything we discussed. I'm almost sorry to see all that complicated code go ;-) Some notes:

  1. I'll dig into why ParallelLogger isn't behaving. It should definitely be possible to have multiple loggers at the same time.
  2. EvidenceSynthesis is somewhat of an outlier, since the base package is in CRAN and only holds the statistical functions. All the data wrangling is done in the module. I don't think we'd want all of that to go into the base package, also because it can't depend on RMM, which is not in CRAN. Currently the base package knows nothing about results models. I'm not sure if we should put the data wrangling in some new package, or leave it here in Strategus. I think for the foreseeable future, we can just move the resultsDataModelSpecification.csv into Strategus?
anthonysena commented 4 months ago

Thanks for the review of this PR @schuemie! I'm working out changes for the logging (I had missed creating the results folder).

Let me address your point about EvidenceSythessis: I think for ES, we can move the resultsDataModelSpecification.csv into Strategus since the logic for constructing the results lies in the module. For the other packages (e.g CohortIncidence), I'd prefer to see the resultsDataModelSpecification.csv reside in the package since the logic for creating the results lives there.

chrisknoll commented 4 months ago

Posting this here per request of @anthonysena 👍

JobContext vs (connectionDetails, analysisSpecifications, executionSettings): Initially we had an object (a JobContext) that could encapsulate the design and the execution settings together to give us a single message format to pass into the module functions. The new form of the API accepts the 3 params separately, and depends on a parent-class function being called .createJobContext (which is invoked in the super of execute, upload, validate, etc). This leads to another dimensionality of understanding of what the classes at the parent level are doing, which normally should help you along with your implementation but forgetting or not calling super$ on the specific functions puts more of a burden than helping. I’ll talk about the idea around the R6 class hierarchy later.

I think we need to rethink what constitutes the 2 main elements of the strategus API: the analsyisSpec and the executionSettings. I’d like it if we constructed an R6 class of StrategusAnalysis which contains the elements for cohort defs, and each of our analysis specification (CI, CD, PLP, etc). We shouldn’t need to construct a job context, but rather if the PLP module wants the PLP input, it can just go to the Strategus analysis object and fetch the elements form it that contains the analysis spec for it. executionSettings is something separate which encapsulates information about the execution (what is the root folder, is there a table prefix for results, what are the schemas, what is the connection, etc). I think we should just reduce the api calls to needing one or both of the above, and not have connectionDetails as a separate param (we can get into details why that was necessary, unless it was a matter that you wanted to serialize executionSettings as JSON and not have connectionDetails in that payload but I don’t think you need to serialize executionSettings…it’s just encapsulating data)

Certain functions would only need analysisSpecifications as an input. I’m thinking validate() would be in this case. In CI’s case, I would fetch the IR design from the analysisSpecifications and do validation work on it (maybe I want to ensure that the target and outcome cohorts in the design have been defined in the cohortDefinitionSet for example). But since validation isn’t about execution, you wouldn’t need executionSettings in this context. Unfortunately, it looks like the validateModuleSpecifications takes a third construct: moduleSpecifications which is none of the above, and introduces a new type of information that I believe could be found inside the analysis specifications. Although, I also remember something about providing ‘module parameters’ like ‘include inclusion stats’ in cohortGenerator, but I think we should get away from these special locations of this type of information: if the analysis calls for inclusion rule statistics for cohort generator, it should be part of the analysis specs.

Logging: Logging bit me a bit due to the way the inheritance was being implemented. It seemed the super$ of execute and upload (maybe validate?) would do some ‘class setup’ at the start of the function call to create loggers, and the problem is that if you use execute that calls over to validate and each of those methods do a super$ to initialize a log, they will both try to open an output file to log and result in an error. So, I think we should take logging out of classes, and just use the standard R api to log (like message, warn, etc) and I believe ParallelLogger handles redirecting those calls to the target file somehow. I’m not exactly sure how ParallelLogger does this, but I can say from the many other logging frameworks I have worked with, you set up your logging configuration at the application level (such as location of the log file, how it should rollover (either by time or size) and even specify what logging levels you want to write to the log based on the logging level and you write to the logs using a standard logging API. I’m not sure we have anything this sophisticated in R, but I think we’re attempting to reinvent the logging wheel in strategus and I think we are going against some standard/best practices in logging.

R6 Class inheritance: I think we’re depending too heavily on implementing a class structure where child classes have dependencies on behavior of the parent, making it more complicated to implement a module because you have to understand the details of the parent in order to fully implement the child. This can work but only after very careful planning. In all my years, I have found an effective class hierarchy emerges over time, or when there was a very specific plan about how the parent’s class behavior was to be extended by the child. I think the classic effective inheritance application is the notion of an abstract base class where you have the general process defined in the parent and the children ‘fill in the blanks’ by implementing the parent’s abstract methods. However, that’s not what’s happening in the Strategus module design. I recommend we strip most (if not all except the core module functions) of the inheritance and instead just have a base class that provides the ‘not implemented’ functions for the 3 or 4 methods (execute, validate, upload, createResultsModel), and have the children responsible for ‘filling in the blanks’ of how to implement those operations. If we see that some of these functions depend on similar functions (such as: get the the moduleSettings for this module from the analysisSettings), I’d rather have a utility function that allows the reuse of the operation (and easier unit testing) than a parent function that is inherited to the child. To this point: it’s not specifically a module function to try to extract module settings from an analysisSpec. I definitely want to avoid ‘special parent state’ that needs to be initialized in order for the child to function properly. This just leads to error.

Strategus Design spec R6 Class: I feel like this has been put off for a long time, and I’d like to not push this off any further and build the StrategusDesignSpec class that encapsulate the elements of a Strategus analysis. This will simplify the navigation to the analysis elements. My worry is that we’ll get to it all ‘working’ under the old model, and it’s going to be too painful to make the switch (which will fundamentally impact all the modules because the analysis spec is changing). So let’s bite that bullet now, I’m an 100% confident that a 3 hour focus time can specify the design elements (we already know what goes into it, it’s just arranging it in a simpler form that is easier to navigate). Once we have this, then the analsysSpecs params of all of our Module functions become a very easy to navigate and known quantity that each module can consume to do their work. Let’s carve out the time, get it done, and then adjust the modules to work with the new structure.

I thought analysisSpecification was a class, I tried to search for it in RStudio but only found a AnalysisSpecifications.R under man-roxygen, which led me to createEmtpyAnalysisSpecifciations…currently the analysisSpec is a list of sharedReosurces and moduleSpecifications, with a class() attribute of AnalysisSpecifciations. Let’s make this a full blown R6 class.

anthonysena commented 4 months ago

Thanks @chrisknoll for the review! Let me respond to your feedback, in a slightly different order:

Logging: Logging bit me a bit due to the way the inheritance was being implemented. It seemed the super$ of execute and upload (maybe validate?) would do some ‘class setup’ at the start of the function call to create loggers, and the problem is that if you use execute that calls over to validate and each of those methods do a super$ to initialize a log, they will both try to open an output file to log and result in an error. So, I think we should take logging out of classes, and just use the standard R api to log (like message, warn, etc) and I believe ParallelLogger handles redirecting those calls to the target file somehow. I’m not exactly sure how ParallelLogger does this, but I can say from the many other logging frameworks I have worked with, you set up your logging configuration at the application level (such as location of the log file, how it should rollover (either by time or size) and even specify what logging levels you want to write to the log based on the logging level and you write to the logs using a standard logging API. I’m not sure we have anything this sophisticated in R, but I think we’re attempting to reinvent the logging wheel in strategus and I think we are going against some standard/best practices in logging.

Agreed - I've removed the logging piece of the StrategusModule per https://github.com/OHDSI/Strategus/pull/145/commits/87b33b4be6a0432bad6091ad53962b36cfb2ad9e.

JobContext vs (connectionDetails, analysisSpecifications, executionSettings): Initially we had an object (a JobContext) that could encapsulate the design and the execution settings together to give us a single message format to pass into the module functions. The new form of the API accepts the 3 params separately, and depends on a parent-class function being called .createJobContext (which is invoked in the super of execute, upload, validate, etc). This leads to another dimensionality of understanding of what the classes at the parent level are doing, which normally should help you along with your implementation but forgetting or not calling super$ on the specific functions puts more of a burden than helping. I’ll talk about the idea around the R6 class hierarchy later.

I think we need to rethink what constitutes the 2 main elements of the strategus API: the analsyisSpec and the executionSettings. I’d like it if we constructed an R6 class of StrategusAnalysis which contains the elements for cohort defs, and each of our analysis specification (CI, CD, PLP, etc). We shouldn’t need to construct a job context, but rather if the PLP module wants the PLP input, it can just go to the Strategus analysis object and fetch the elements form it that contains the analysis spec for it. executionSettings is something separate which encapsulates information about the execution (what is the root folder, is there a table prefix for results, what are the schemas, what is the connection, etc). I think we should just reduce the api calls to needing one or both of the above, and not have connectionDetails as a separate param (we can get into details why that was necessary, unless it was a matter that you wanted to serialize executionSettings as JSON and not have connectionDetails in that payload but I don’t think you need to serialize executionSettings…it’s just encapsulating data)

Certain functions would only need analysisSpecifications as an input. I’m thinking validate() would be in this case. In CI’s case, I would fetch the IR design from the analysisSpecifications and do validation work on it (maybe I want to ensure that the target and outcome cohorts in the design have been defined in the cohortDefinitionSet for example). But since validation isn’t about execution, you wouldn’t need executionSettings in this context. Unfortunately, it looks like the validateModuleSpecifications takes a third construct: moduleSpecifications which is none of the above, and introduces a new type of information that I believe could be found inside the analysis specifications. Although, I also remember something about providing ‘module parameters’ like ‘include inclusion stats’ in cohortGenerator, but I think we should get away from these special locations of this type of information: if the analysis calls for inclusion rule statistics for cohort generator, it should be part of the analysis specs.

Strategus Design spec R6 Class: I feel like this has been put off for a long time, and I’d like to not push this off any further and build the StrategusDesignSpec class that encapsulate the elements of a Strategus analysis. This will simplify the navigation to the analysis elements. My worry is that we’ll get to it all ‘working’ under the old model, and it’s going to be too painful to make the switch (which will fundamentally impact all the modules because the analysis spec is changing). So let’s bite that bullet now, I’m an 100% confident that a 3 hour focus time can specify the design elements (we already know what goes into it, it’s just arranging it in a simpler form that is easier to navigate). Once we have this, then the analsysSpecs params of all of our Module functions become a very easy to navigate and known quantity that each module can consume to do their work. Let’s carve out the time, get it done, and then adjust the modules to work with the new structure.

I thought analysisSpecification was a class, I tried to search for it in RStudio but only found a AnalysisSpecifications.R under man-roxygen, which led me to createEmtpyAnalysisSpecifciations…currently the analysisSpec is a list of sharedReosurces and moduleSpecifications, with a class() attribute of AnalysisSpecifciations. Let’s make this a full blown R6 class.

There are a few ideas in this feedback so let me take them one at a time. First, I retained the notion of a jobContext specifically to keep the scope of this particular feature branch small. This was a practical consideration since it made the job of moving the code from 7 different repos into Strategus much easier - I can visually inspect the code in this branch and the corresponding source code repository to verify it is the same. When this work is merged into the v1.0-main branch, I'll squash this work down and we'll have a commit that shows that we've moved the code. We can and will continue to refactor from this point forward until we release v1.0. Please view the JobContext class as something internal to the behavior of the modules for now and simply to help us move forward with the refactoring. It may be completely refactored out later as we address other issues, such as changing the analysis specification.

Changing the analysis specification is an issue for #144 and we can revisit these ideas there, including how modules work with the analysis specification. I think making it an R6 class makes sense. To your point about moduleSpecifications vs. analysisSpecfications: an analysis specification is made up of 1 or more module specifications so my thinking is that a module is only responsible for confirming that its settings are valid, not the entire analysis specification. Additionally, we'd like to be able to validate the module specifications when assembling the analysis specification. We probably should have a validate(analyisisSpecification) function in Strategus that can work with the analysis specification and use the module to validate its individual parts. We can open a related issue on this topic if we want to discuss validation and its impact on the API.

I'd also prefer to keep the connection details parameter so that we can consider both the analysis specifications and the execution settings when performing an incremental run as requested in #95. I'd like to avoid putting anything that requires sensitive information (like the connection details) in the execution settings so we can serialize those settings and use them to determine when a module needs to run based on the inputs changing.

R6 Class inheritance: I think we’re depending too heavily on implementing a class structure where child classes have dependencies on behavior of the parent, making it more complicated to implement a module because you have to understand the details of the parent in order to fully implement the child. This can work but only after very careful planning. In all my years, I have found an effective class hierarchy emerges over time, or when there was a very specific plan about how the parent’s class behavior was to be extended by the child. I think the classic effective inheritance application is the notion of an abstract base class where you have the general process defined in the parent and the children ‘fill in the blanks’ by implementing the parent’s abstract methods. However, that’s not what’s happening in the Strategus module design. I recommend we strip most (if not all except the core module functions) of the inheritance and instead just have a base class that provides the ‘not implemented’ functions for the 3 or 4 methods (execute, validate, upload, createResultsModel), and have the children responsible for ‘filling in the blanks’ of how to implement those operations. If we see that some of these functions depend on similar functions (such as: get the the moduleSettings for this module from the analysisSettings), I’d rather have a utility function that allows the reuse of the operation (and easier unit testing) than a parent function that is inherited to the child. To this point: it’s not specifically a module function to try to extract module settings from an analysisSpec. I definitely want to avoid ‘special parent state’ that needs to be initialized in order for the child to function properly. This just leads to error.

I chose to use an inheritance structure here on purpose since I want to mandate certain behavior of a StrategusModule. Most of what is in the base class at this point is: defining how a module is named, how its settings class is defined, validating inputs for the function (which is where child classes can use super$ to access them) and to build the jobContext. Putting aside the jobContext since I expect that will change, I think the inheritance structure works well since I'd like to make sure that each child module has some standard behavior defined at the parent level. I think the child classes are still "filling in the blanks" to your point with the exception of the jobContext which we can hopefully refactor out as we move forward. If we need to refactor out utility functions to expose them at the package level, I'm open to that as well but for now I like the encapsulation of having private functions for the needs of the module.