OHDSI / Strategus

[Under development] An R packages for coordinating and executing analytics using HADES modules
https://ohdsi.github.io/Strategus/
6 stars 12 forks source link

How to design and distribute a Strategus network study #148

Open anthonysena opened 3 months ago

anthonysena commented 3 months ago

Adding this issue since it is related to a number of other issues that are part of the v1.0 milestone such as #98, #78, #29.

Per discussion with @schuemie: As discussed at the last Global Symposium, one idea is to combine the renv.lock file with the study analysis specifications into a single JSON object, and that renv.lock file can be anything.

We imagined several scenarios, including:

  1. Running Strategus in an execution engine. Here, the execution engine would extract the lock file from the specifications. If the lock file matches an existing container (e.g. because the lock file corresponds to a HADES-wide release) it would use that container. If not, it would copy a container and instantiate the lock file there before calling on Strategus
  2. If someone doesn’t have an execution engine, there would need to be a first step to instantiate the lock file.

For now, we could just have the lock file and the specifications be separate (although they definitely belong together). Anyone wanting to run the study would use the lock file to instantiate their R environment, and at that point the right version of Strategus would be installed and ready to run the specifications.

This does beg the question: how do we want to design and distribute a Strategus network study? If we look at previous studies that used Strategus (anti-VEGF SOS Study), we are using an R Project to encapsulate the entire study. We've also discussed only requiring the analysis specification JSON as the means to exchange a full study amongst collaborators. Let's use this issue to discuss this important topic.

schuemie commented 3 months ago

As I mentioned, for now I would just keep renv lock file and specifications JSON separate. A strawman proposal on how to distribution that in the short term:

  1. Put the renv lock file and specifications JSON in a repo in ohdsi-studies.
  2. In the README, add instructions on how to use these to run the study. For example:
    
    # Instantiate R environment:
    install.packages("renv")
    download.file("https://raw.githubusercontent.com/ohdsi-studies/<repo name>/main/renv.lock", "renv.lock")
    renv::init()

Run study:

download.file("https://raw.githubusercontent.com/ohdsi-studies//main/specs.json", "specs.json") analysisSpecifications <- ParallelLogger::loadSettingsFromJson("specs.json") executionSettings <- Strategus::createCdmExecutionSettings( workDatabaseSchema = "", cdmDatabaseSchema = "", cohortTableNames = CohortGenerator::getCohortTableNames(cohortTable = ""), workFolder = "", resultsFolder = "", minCellCount = 5 )

connectionDetails <- DatabaseConnector::createConnectionDetails(...) Strategus::execute( analysisSpecifications = analysisSpecifications, executionSettings = executionSettings, connectionDetails = connectionDetails )


Alternatively, or additionally, we could have the repo contain an RStudio project with the renv lock file, specs JSON, and a single R script that instantiates the R environment and executes the study. People could clone the repo and modify and run the R script.

In the future we'd expect an execution engine to handle all of this.
anthonysena commented 3 months ago

Thanks @schuemie - I'd support keeping the renv.lock file and the analysis specification as separate documents since they serve two different functional purposes. Having the renv.lock file describe the configuration of the R environment for the study enables us to use renv as that package intends vs. trying to expose renv functionality inside of Strategus.

Alternatively, or additionally, we could have the repo contain an RStudio project with the renv lock file, specs JSON, and a single R script that instantiates the R environment and executes the study. People could clone the repo and modify and run the R script.

I'd support using an RStudio project for distributing a Strategus study since it would also allow us to bundle together additional resources as mentioned in #98 and provide support for viewing results at a specific site per #78.

In the future we'd expect an execution engine to handle all of this.

I'll tag @konstjar for his thoughts here. Arachne supports uploading a study .zip file that contains a script used to execute the study and supports supplying that script with parameters for execution. So I think that having either a script as you showed in your previous post or an R Project would work well to run via Arachne.

anthonysena commented 3 months ago

Bringing over this from #51 so that we can discuss in the context of how we'd propose the design and distribution of a Strategus study.

Here is a proposed workflow, assuming that: an OHDSI network study is proposed, a protocol exists, etc and we've setup our R environment following the HADES R Setup instructions:

  1. Create a new R Project
  2. renv::init() to set up the renv code infrastructure
  3. Download the latest HADES-wide lock file for use in the project (assumption: Strategus is in the HADES-wide lock file)
  4. renv::restore() to ensure the HADES-wide dependencies are installed.
  5. Follow the Strategus vignette for creating an analysis specification and save the analysis specification in the R Project somewhere.
  6. Run some smoke tests to make sure it works (need to expand on how to do this but likely run study against Eunomia?).
  7. Upload to a study repo on https://github.com/ohdsi-studies

A potential pitfall with the workflow above is that we could become out-of-sync between the renv.lock file that was used to create the analysis specification and the packages needed to run the study. @schuemie suggested here that we include the renv.lock file into the analysis specification so that we have a hard link between the lock file and specifications. Additionally, @chrisknoll expressed a desire to have a release of Strategus comes with a published renv.lock file that contains which versions of package dependencies have been tested with the given version of Strategus, and that within a single release of Strategus you may have multiple updates to underlying packages.

So, if we adopt the ideas above (include renv.lock in the analysis spec and have an renv.lock file that comes with Strategus), what does a developer workflow look like to design a study using Strategus? Here's how I was thinking about it at the moment, sticking with the idea that we're still distributing an R project:

  1. Create a new R Project
  2. renv::init() to set up the renv code infrastructure
  3. Install Strategus.
  4. Call something like Strategus::copyRenvLockFile(destination = ".") to copy the renv.lock file from Strategus into the root of the project.
  5. renv::restore() to ensure the HADES-wide dependencies are installed, including Strategus.
  6. Follow the Strategus vignette for creating an analysis specification and save the analysis specification in the R Project somewhere. Presumably this process will then describe how to embed the renv.lock file into the analysis specification.
  7. Run some smoke tests to make sure it works (need to expand on how to do this but likely run study against Eunomia?).
  8. Upload to a study repo on https://github.com/ohdsi-studies

What makes me uneasy about this approach is that if we need to change an R dependency, we'd have to update the renv.lock file in the root of the project AND the analysis specification. We'd potentially need methods inside of Strategus to: keep the renv.lock file in the root of the project in sync with the one that ultimately winds up in the analysis specification, methods to check that the environment used to execute the study is consistent with what is declared in the analysis specification, etc.

I fully agree that we need to have a hard link between the lock file and specifications. Is that not what the R project is providing in this case since it is including the renv.lock file?

Also tagging @mdlavallee92 as I think setting up the Strategus development environment was a topic in Ulysses here: https://github.com/OHDSI/Ulysses/issues/17. We'd need to decide on how we want developers to design a study using Strategus and where needed we can use Ulysses to help with the setup.

ablack3 commented 3 months ago

I'll add my two cents based on my limited experience in Darwin so far. There seems to be three layers that we might need to change between studies: parameters, code, environment.

For some things like cohort diagnostics we want to run the same code in the same environment over and over but with different parameters (cohort definitions). In all studies I've seen so far we do need to change the the R code that is executed and cannot simply swap out parameters to get the required results. We are also changing the environment between studies although this might not be strictly necessary for every study but happens because each data scientist creates a lockfile based on the packages they are currently using. The flexibility to add custom code or slightly modify a mostly standardized study seems to be quite important.

image

Here is the process I'm thinking will work best at the moment:

I'm not sure if this adds much to the Strategus discussion since Strategus is, in my understanding, supporting the use case where we only change parameters between studies which is the goal for scaling up standardized analytics. But I still wanted to share how I'm currently thinking the process will work best in Darwin at least for off the shelf studies that are using Arachne/execution engine.

schuemie commented 3 months ago

As discussed yesterday, perhaps we could create a StrategusBootstrap package (might come up with a cooler name). This would be a very light package, with only a dependency on renv. It could have the following functionality:

  1. Contain one or more renv lock files that can, through a function call, can be written to the project root, and then instantiated using renv::restore().
  2. Ability to consume a renv lock file somehow bundled with the analysis specifications (either together in a single repo, or together in a single JSON file), write it to the project root, and call renv::restore()

The first is aimed at the time when we start to design our study. The user might do something like:

install.packages("StrategusBootstrap")
StrategusBootstrap::createStrategusEnvironment()

After this, Strategus and its dependencies will be installed and ready to start defining the study.

The second is aimed at network study execution. The user might do something like:

install.packages("StrategusBootstrap")
StrategusBootstrap::createStrategusEnvironment("ohdsi-studies\SemaglutideNaion")

After this, Strategus and its dependencies are installed and ready to execute the study.

Both functionalities could write some additional R file(s) that the user can use for their respective task (designing the study or executing the study)

anthonysena commented 1 month ago

I'm going to remove this from the v1.0 milestone and will leave this issue open since we have not fully addressed all of the points from this discussion.

For the v1.0 release, this will be documented and tracked via https://github.com/ohdsi-studies/StrategusStudyRepoTemplate/issues/4.