OHDSI / Strategus

[Under development] An R packages for coordinating and executing analytics using HADES modules
https://ohdsi.github.io/Strategus/
6 stars 11 forks source link

Add linkability to design documents to implement JSON Hypertext Application Language (HAL) #98

Open azimov opened 10 months ago

azimov commented 10 months ago

Strategus design documents are currently isolated resources but there are a number of limitations with this approach in terms of meta-data linkability.

For example:

If I have a study design document I want to link to the protocol for the study and any phenotype algorithms used in the study. And If I'm searching for studies I want to be able to find them based on drug/disease areas of interest, or to relate interests and other investigators may wish to be aware of studies executed in related areas which can be greatly aided through linkability.

My proposal is that we adopt JSON HAL

This approach has two classes for consideration:

Any resource can also, optionally, include a link.

Link Resource Caching

The embedded object is important in our context because any link can be embedded in the document. This means we could support both the definitions:

cohortDefinitions:{
   "_links": {
       cohort1: {href:"https://phenotype_library.com/phenotype_id"},
       cohort2: {href:"https://phenotype_library.com/phenotype_id"}
   }
}

AND

cohortDefinitions:{
   "_links": {
       cohort1: {href:"https://phenotype_library.com/phenotype_id"},
       cohort2: {href:"https://phenotype_library.com/phenotype_id"}
   },

  "_embedded" : {
      ... < cohort definitions>
    }
}

This is referred to as the "hypertext cache pattern" and would allow us to share payloads that include all external resources (which is strongly desired for passing studies around) but could start to create maintainability and auditability headaches.

Security note

The HAL design should never include executable content, this would include the embedded JSON that we currently use for cohort definitions. Though this is low risk it would, potentially, be exploitable.

schuemie commented 7 months ago

A crucial part of the Strategus specs is that they're self-contained, for at least two reasons:

  1. For reproducibility: I want to be able to run a study even when external libraries may have changed.
  2. For air-gapped environments, where I don't have access to external libraries.

I'm all for including meta-data in Strategus specifications that allow you to trace where cohort definitions etc. came from. But using HAL seems to turn this around: the external link is required, but the embedding is optional?

Also, where would the URLs come from? In your example you made up a "phenotype_library.com", but what would be a real example of a URL in OHDSI? How would it work if I for example design a study inside my organization, and would like to run it as an OHDSI network study?

azimov commented 7 months ago

A crucial part of the Strategus specs is that they're self-contained, for at least two reasons:

  1. For reproducibility: I want to be able to run a study even when external libraries may have changed.
  2. For air-gapped environments, where I don't have access to external libraries.

The files can either be embedded. A resource can (and should) have multiple links. We can propose a practice of having a URL for its original source (if in the phenotype library) and the local path relative to the document.

Also, where would the URLs come from? In your example you made up a "phenotype_library.com", but what would be a real example of a URL in OHDSI? How would it work if I for example design a study inside my organization, and would like to run it as an OHDSI network study?

The path can be any URI, and best practice for us would be a relative path. To me it seems preferable to have a tarball for a study, as opposed to a single document, as when you get to the hundreds or thousands of cohorts you can actually audit them. The use of "_embed" gives us the flexibility to cache the resources inside the document (and also adds an extra form of validation at run time to see if the resources are present).

schuemie commented 7 months ago

I know all of this is very subjective, but I think there's a lot of benefit to the simplicity of one study - one JSON file.

Changing that to a tarbal with internal relative path linkages adds a lot of complexity, for no obvious gain. It also doesn't server the purpose of documenting where the artifacts (e.g. cohorts) came from, which I thought was the reason you proposed HAL.