OHDSI / Strategus

[Under development] An R packages for coordinating and executing analytics using HADES modules
https://ohdsi.github.io/Strategus/
6 stars 10 forks source link

Remove `sharedResources` from analysis specification #144

Open anthonysena opened 6 days ago

anthonysena commented 6 days ago

The v0.x structure of a Strategus analysis specification is of the form

{
   "sharedResources":[....],
   "moduleSpecifications": [...]
}

The original thinking here is that we might need to refer to different "shared resources" which has mainly been the cohort definitions (with subsets) and negative control outcomes. Under the hood, Strategus will take the sharedResources section of the analysis specification and the selected module specifications and pass that to the module for execution.

@chrisknoll and others have proposed removing this section since it is a bit of a "catch all" that lacks structure. I'd propose that we simply store the elements of the cohort definitions, subsets and negative control outcomes with the CohortGeneratorModule specifications. We can then pass the CohortGeneratorModule specifications & the specific module's specifications into the job context for the module so that, if needed, these elements can be accessed.

Full analysis specification example: https://github.com/ohdsi-studies/FluoroquinoloneAorticAneurysm/blob/master/inst/analysisSpecification.json

chrisknoll commented 6 days ago

Thanks, @anthonysena for putting this issue out there for discussion.

My feeling on the 'shared resources' element is that it's a bit of a redundant moniker on those elements of the Strategus JSON, where you could just have cohort definitions, negative control specs etc, just declared at the top-level of the strategus where all modules know that they exist. One complication/confusion we have today is that some people think about putting the design specification inside the module node of Strategus, while other things were found in 'shared'. My thought about structuring this would be:

root
| - CohortDefinitions
| - SubgroupDefinitions
| - CharacterizationAnalysis
| - IncidenceAnalysis
| - CohortMethodAnalysis
| - PredictionAnalysis
| - Modules

So under this structure, it's pretty clear what the elements of the Strategus json. In this way, all the design elements have a 'home'. The 'Modules' one is where we currently specify which modules are in play for the given analsyis (that's where you find the version and names of the modules), however, I think we're moving away from the 'versioned modules', correct? Meaning: when bulding a Strategus spec, you will be using the Strategus API based on a given version of the library (managed in either a Docker container or renv.lock file) which will produce the JSON payload. then when you exeucute, you just use the proper runtime context for this specification (again, either run within a docker or initilized from a renv.lock).

With the runtime context established, the process of executing the strategus analysis would be:

I also think we discussed separating the 'publish results' to a separate function, eithr as a separate function of Strategus, or depend on a completely external process for handing strategus output.

anthonysena commented 6 days ago

My feeling on the 'shared resources' element is that it's a bit of a redundant moniker on those elements of the Strategus JSON, where you could just have cohort definitions, negative control specs etc, just declared at the top-level of the strategus where all modules know that they exist.

I think we agree that the notion of sharedResources should be removed in favor of named elements as you have proposed. Where I am still a bit uneasy is the promotion of these elements (CohortDefinitions for example) to the top-level of the specification. Part of my uneasiness stems from the fact that we (you and I) presume cohort definitions mean Circe definitions. Additionally, the SubgroupDefinitions and NegativeControlOutcomeCohortDefinitions are settings/specifications that the CohortGenerator works with to construct the cohorts. So promoting these beyond the scope of CohortGenerator feels wrong (even though that's what we've done in sharedResources).

Perhaps the full analysis specification should be made available to each module so that it may be interrogated? If another module (i.e. CohortDiagnostics) needs info about the cohorts, it can interrogate the specification for the CohortGenerator module?

One complication/confusion we have today is that some people think about putting the design specification inside the module node of Strategus, while other things were found in 'shared'. My thought about structuring this would be:

For reference, here is how the design specification is nested inside of the module node as you described (in case anyone else reading this is wondering):

  "moduleSpecifications": [
    {
      "module": "CohortGeneratorModule",
      "version": "0.1.0",
      "remoteRepo": "github.com",
      "remoteUsername": "ohdsi",
      "settings": {
        "incremental": true,
        "generateStats": true
      },
      "attr_class": ["CohortGeneratorModuleSpecifications", "ModuleSpecifications"]
    }

Moving these to a top level of the design document would make this clearer to your point. I then presume you'd have only 1 specification for a module in a given analysis specification in this case? I think this is fair since a module can have its settings handle multiple analyses.

The 'Modules' one is where we currently specify which modules are in play for the given analysis (that's where you find the version and names of the modules), however, I think we're moving away from the 'versioned modules', correct?

That's right - the version of the "module" will be dictated by the version of Strategus that is declared in the renv.lock file and the corresponding module's package version.

With the runtime context established, the process of executing the strategus analysis would be:

Generate Cohorts If CharacterzationAnalsyis specified, run CharacterizationAnalysis ... repeat above check for each other type of analysis we want to run. maybe this list can be specified as an input parameter to the execution call (ie: execute(design, options) where options lets you control which analysis to run.

Generate Cohorts is a special case - we have to do that one up front and so agreed that it comes first. Also agreed about how we might loop through the additional elements of the analysis specification. Fully agreed about the function signature execute(design, options) where options lets you control which analysis to run. - this has been a pain in the v0.x line since when we want to execute part of an analysis, it required that you create another analysis specification which is not ideal.

I also think we discussed separating the 'publish results' to a separate function, either as a separate function of Strategus, or depend on a completely external process for handing strategus output.

I think this is out of scope for this topic (analysis specification changes) but yes, I'd expect that results management is different from the analysis specification and execution. This difference will result in different functions of Strategus for these areas of concern.