OHDSI / WebAPI

OHDSI WebAPI contains all OHDSI services that can be called from OHDSI applications

Apache License 2.0

125 stars 165 forks source link

Estimation & Patient Level Prediction Specification Editors #568

Closed anthonysena closed 5 years ago

anthonysena commented 6 years ago

Overview

ATLAS & WebAPI provide the ability to design population level effect estimation (PLE) and patient level prediction (PLP) studies. The current set of capabilities in ATLAS are limited in several respects:

The ATLAS designer does not expose all of the options available in each of the underlying R methods libraries: CohortMethod in the case of PLE and PatientLevelPrediction in the case of PLP.
The ATLAS designer is limited to a single comparison (Target/Comparator/Outcome) for PLE where the CohortMethod package supports designing for multiple comparisons with multiple outcomes.
The ATLAS designer supports a single analysis specification whereas CohortMethod supports designing with multiple analysis specifications

There are other gaps but the idea is here is that we'll close these gaps and provide editors that support the full range of options available in the packages.

High-Level Design

Study Specification Details

The object models that provide the details for how Estimation/Prediction studies will be specified have been detailed in the OpenAPI 3.0 format and are currently in a branch in the OHDSI Specifications repository: https://github.com/OHDSI/Specifications/tree/init. The specifications are also posted on SwaggerHub whicih makes viewing/navigating through them a bit easier:

Estimation: https://app.swaggerhub.com/apis/anthonysena5/Estimation/0.9.0 Prediction: https://app.swaggerhub.com/apis/anthonysena5/Prediction/0.9.0

The shared dependencies for Estimation & Prediction have been externalized into their own specification documents:

ConceptSet: https://app.swaggerhub.com/apis/anthonysena5/ConceptSet/0.9.0 CohortDefinition: https://app.swaggerhub.com/apis/anthonysena5/CohortDefinition/0.9.0 Cyclops: https://app.swaggerhub.com/apis/anthonysena5/Cyclops/0.9.0 FeatureExtraction: https://app.swaggerhub.com/apis/anthonysena5/FeatureExtraction/0.9.0

While OpenAPI 3.0 supports detailing the REST endpoints, I've chosen to do that as part of the WebAPI work.

These objects will be concretely defined in Java and stored in the StandardizedAnalayisAPI repository: https://github.com/OHDSI/StandardizedAnalysisAPI. These objects, in turn, will be utilized in WebAPI to facilitate the CRUD type operations.

Hydra

The aim of the Estimation & Prediction specifications above is to capture and encapsulate all of the study design choices and dependencies needed for study execution. The next step is to create an executable unit of code to run the study against an OMOP CDM. This will be facilitated by the Hydra component:

https://github.com/OHDSI/Hydra

Hydra will hydrate a package skeleton into executable R study packages based on the specifications above in JSON format. A study "skeleton" is a generic R Package template designed to utilize the OHDSI R methods libraries for study execution.

Hydra will be an R package and Java library which will allow us to reference the Hydra .JAR file as a dependency in WebAPI for creating the study package via a REST endpiont.

Workflow

The following is the proposed workflow for designing and executing studies.

Design PLE/PLP Study In ATLAS

Design your study using the revised editors in ATLAS. From the interface, users will have the ability to export the full study specification in JSON format that complies with the specifications referenced earlier. Furthermore, they can download a full study package generated by Hydra to execute the study.

Execute the study

The R package produced via ATLAS/WebAPI/Hydra will provide users with a unit of executable code for their OMOP CDM(s). Furthermore, the R package provides a fully encapsulated study that is transportable for network studies. The package will be executable through both Arachne or R Studio.

anthonysena commented 6 years ago

Object Model

The following diagram depicts the proposed new tables for storing an estimation specification into to WebAPI:

Note: The diagram above has been revised to reflect discussion and decisions made in this thread.

It should be noted that a similar set of tables will be created storage for PatientLevelPrediction studies. I'm excluding those for now so that we can hash out any design discussions on this set of tables and carry that design thinking forward to the PLP tables. A few notes on the tables in the model:

estimation - this table will be in addition to the current "cca" table - we'll look to archive off or drop the 'cca' tables in the next major version of WebAPI. The Estimation.type field will hold the type of estimation study specified. In this initial release, this field will hold "ComparativeCohortAnalysis" and in subsequent releases this will allow us to support the other types of estimation studies currently supported by the R Methods Libraries (i.e. Case Control, Case Crossover, Self-Controlled Cohort, etc).
estimation_cohort & estimation_concept_set: These respective tables will associate an estimation analysis with the cohorts and concepts sets in the WebAPI DB (see FK relationships to existing cohort_definition and concept_set tables).

pbr6cornell commented 6 years ago

Great work Sena and team! This is going to be an important step in the right direction to get to the point where we can design and execute studies in a consistent and reproducible process that is fully transportable across the OHDSI network.

At a high-level, I like the direction and am very encouraged by the progress.

A few low-level comments that are probably in the weeds, but I'll document there here before I forget:

I'm confused what a 'estimation concept set' would be. We use conceptsets in our cohort definitions. We also use conceptsets for specific inputs to estimation studies, such as covariates to include/exclude or as a list of negative controls, but these each seem like quite distinct functions. I suspect I'm just not understanding the intent here.
When we talk about 'negative controls', I think it would be wise to clearly differentiate negative control outcomes and negative control exposures, and not make any presumptions about which would be used for any particular analysis (with a vision toward a possible future where an analysis may involve BOTH negative control outcomes AND negative control exposures). As a general construct, I would think a negative control for a comparative cohort analysis is a T/C/O tuple for which we believe the true comparative effect (RR) = 1. If the T/C are the same as the question of interest but the O differs, then its a negative control outcome...If the O is the same as the question of interest, but the T/C differs, its a negative control exposure. Perhaps a negative control object should simply store: T, C, O, control type (= outcome or exposure). Alternatively, if it is just a T/C/O tuple, then the 'control type' can be inferred, but that puts a hard reliance on alignment of the cohortIds from the negative control list and the analysis of interest.
I may have simply missed it in the spec, but it wasn't clear how/if we are accommodating empirical calibration of CI through synthetic positive controls (derived using the negative control sets).

On Wed, Aug 22, 2018 at 11:14 AM, Anthony Sena notifications@github.com wrote:

Overview

ATLAS & WebAPI provide the ability to design population level effect estimation (PLE) and patient level prediction (PLP) studies. The current set of capabilities in ATLAS are limited in several respects:

The ATLAS designer does not expose all of the options available in each of the underlying R methods libraries: CohortMethod in the case of PLE and PatientLevelPrediction in the case of PLP.

The ATLAS designer is limited to a single comparison (Target/Comparator/Outcome) for PLE where the CohortMethod package supports designing for multiple comparisons with multiple outcomes.

The ATLAS designer supports a single analysis specification whereas CohortMethod supports designing with multiple analysis specifications

There are other gaps but the idea is here is that we'll close these gaps and provide editors that support the full range of options available in the packages. High-Level Design Study Specification Details

The object models that provide the details for how Estimation/Prediction studies will be specified have been detailed in the OpenAPI 3.0 format and are currently in a branch in the OHDSI Specifications repository: https://github.com/OHDSI/Specifications/tree/init. The specifications are also posted on SwaggerHub whicih makes viewing/navigating through them a bit easier:

Estimation: https://app.swaggerhub.com/apis/anthonysena5/Estimation/0.9.0 Prediction: https://app.swaggerhub.com/apis/anthonysena5/Prediction/0.9.0

The shared dependencies for Estimation & Prediction have been externalized into their own specification documents:

ConceptSet: https://app.swaggerhub.com/apis/anthonysena5/ConceptSet/0.9.0 CohortDefinition: https://app.swaggerhub.com/apis/anthonysena5/ CohortDefinition/0.9.0 Cyclops: https://app.swaggerhub.com/apis/anthonysena5/Cyclops/0.9.0 FeatureExtraction: https://app.swaggerhub.com/apis/anthonysena5/ FeatureExtraction/0.9.0

While OpenAPI 3.0 supports detailing the REST endpoints, I've chosen to do that as part of the WebAPI work.

These objects will be concretely defined in Java and stored in the StandardizedAnalayisAPI repository: https://github.com/OHDSI/ StandardizedAnalysisAPI. These objects, in turn, will be utilized in WebAPI to facilitate the CRUD type operations. Hydra

The aim of the Estimation & Prediction specifications above is to capture and encapsulate all of the study design choices and dependencies needed for study execution. The next step is to create an executable unit of code to run the study against an OMOP CDM. This will be facilitated by the Hydra component:

https://github.com/OHDSI/Hydra

Hydra will hydrate a package skeleton into executable R study packages based on the specifications above in JSON format. A study "skeleton" is a generic R Package template designed to utilize the OHDSI R methods libraries for study execution.

Hydra will be an R package and Java library which will allow us to reference the Hydra .JAR file as a dependency in WebAPI for creating the study package via a REST endpiont. Workflow

The following is the proposed workflow for designing and executing studies. Design PLE/PLP Study In ATLAS

Design your study using the revised editors in ATLAS. From the interface, users will have the ability to export the full study specification in JSON format that complies with the specifications referenced earlier. Furthermore, they can download a full study package generated by Hydra to execute the study. Execute the study

The R package produced via ATLAS/WebAPI/Hydra will provide users with a unit of executable code for their OMOP CDM(s). Furthermore, the R package provides a fully encapsulated study that is transportable for network studies. The package will be executable through both Arachne or R Studio.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OHDSI/WebAPI/issues/568, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsrGoLXecL9LkCEDyXkn7huTKa0oYQpks5uTXVYgaJpZM4WH3ZU .

anthonysena commented 6 years ago

WebAPI REST Interface

The following diagram depicts the public interface for the proposed new WebAPI service org.ohdsi.webapi.service.Estimation.java:

Hoping that most of these are pretty self-explanatory "CRUD" operations; for brevity I've omitted protected/private methods for now. A small note about one of the methods that may may be unclear:

exportEstimation(id, target): ID is the identifier for the EstimationAnalysis and the target will provide a way to specify if the export target is JSON (default) or Package (generated via Hydra). I may need to split these into separate REST endpoints based on the response content type but conceptually, that's what this method (or sets of methods) will do.

anthonysena commented 6 years ago

Negative Controls

The following sections will detail out how negative controls are modelled in the specification and how they will be used in performing empirical calibration. Tagging @schuemie for comments and/or corrections.

Negative Control

The negative control class defines as a tuple: target (T), comparator (C), outcome (O). The outcome identifier in this case is a concept_id that represents the exposure/outcome of interest. This will allow us to separate out the negative controls from the comparisons of interest in the study which are modeled using the TargetComparatorOutcomes class. TargetComparatorOutcomes is similar in structure to the NegativeControl class but instead provides an array a set of outcomes of interest along with arrays for covariates for inclusion/exclusion.

Negative Control Cohort Expression

A negative control cohort expression is a new construct aimed at generating a SQL query to generate negative control cohorts:

This expression type will be added to the circe-be package to support the options described in the screenshot above. The NC expression will utilize the outcome concept ids specified in the EstimationAnalyais.negativeControls array to generate a set of negative controls cohorts where the cohort_id will be the negative control concept_id. In an attempt to address Patrick's question above: I believe that we can distinguish between a negative control exposure versus outcome based on the domain(s) array that is part of the expression. Please let me know if you have concerns with this approach.

anthonysena commented 6 years ago

Concept Set Usage

Here's my thinking which is still wavering a bit on this topic. CohortMethod does not utilize concept sets; instead it utilizes lists of concepts (or covariate Ids) as part of the specification. In ATLAS, we create lists of concepts using the concept set construct. As such, we can think of a list of concept IDs as a concept set for the purposes of working with them in Atlas.

When we utilize the estimation import/export functionality, we can translate to/from concept sets to arrays of concept IDs. My thinking is this: when defining an element of an estimation analysis that requires a list of concept IDs through the Atlas UI, we'll work in terms of concept sets. When we export the estimation specification, we will resolve this concept set against the vocabulary to produce a list of concept ID's which are embedded into the specification. When importing an existing specification, we can construct a concept set based on the concept IDs in the specification and establish a reference to the concept set in the repository.

I have a place for storing concept sets as part of the specification (EstimationAnalysis.conceptSets) but this may be unnecessary since we'll be translating these concept set expressions to list of concept IDs. My thinking here is that we'd like to have the concept set expression along with the concept IDs that were used in the design. However, this might be an unnecessary artifact that clutters the specification. Appreciate any/all feedback on this topic.

anthonysena commented 6 years ago

Inject Signal Args

This class in the estimation specification is designed to accommodate empirical calibration of CI through synthetic positive controls (derived using the negative control sets). In reviewing the specification, we allow for an explicit set of target (T) and (negative control) outcomes ({O}). @pbr6cornell please let me know if this helps to address your questions above.

pavgra commented 6 years ago

@anthonysena, thank you for agreeing to move the design into public! Your spec is really comprehensive and gives a deep dive into what's going to be implemented - well done.

After reading all your posts, I have a couple of questions/notes:

Packaging approach

We've already started a talk about the Hydra and I really cannot answer myself why it's required to generate the full skeleton and provide a user with a bunch of files. This has a couple of disadvantages from my perspective:

A user needs to wait for the folder with R files and resources to be generated;
The folder requires certain place for storage;
(!) The generated folders for multiple analyses most likely going to differ just in minor things (the settings defined in Atlas), but you'll still have to put the whole set of files e.g. in Git to share functioning code with others;
The package in your way won't guarantee reproducibility - it relies on external deps;

What I'd like to propose instead:

Have a package that accepts the PLE / PLP design JSON as input and executes the analysis;
Do proper design of the package so that its inner methods are well granularized. In this way, if a user wants to change any behavior, he/she could just overwrite any method (logical part) and store only the diff (overridden parts) with the original package, not all the code;
If we have two points above, then only a JSON with overridden methods (if such exist) should be put into Git, versus a ton of files in case of package generation;
Moreover, I'd suggest looking at Packrat to resolve the changing deps issue and provide 100% reproducibility;

@schuemie , would be grateful for your thoughts on the bullets above.

StandardizedAnalysisAPI

Since we've already started standardization of analyses specifications, I'd ask you to support the initiative and continue work on your interfaces in https://github.com/OHDSI/StandardizedAnalysisAPI repo. What is more, some of the entities which you need for the PLE / PLP spec, I believe, already available there.

Versioning

Why do you have expression field in estimation_cohort? I feel that you want to create a copy of the used cohorts design, but I was confident in that we agreed to use references and work on overall versioning topic separatly. To have it well-designed and implement consistenlty across whole app.

Other

Can there be multiple Estimation Specs per one estimation from a logical perspective? Wondering why it was moved to separate table. In case of versioning, refer to section above.

anthonysena commented 6 years ago

@pavgra - thanks for taking the time to review the design and for the feedback/questions above. I'll attempt to address them here:

Packaging Approach

I'll let @schuemie comment on this approach - I think both approaches have merit. The overarching goal here is to make a unit of code that fully encapsulates the study to support transparency and to support transportability for network studies in the OHDSI data network.

StandardizedAnalysisAPI

I've pushed a branch to that repository and linked it to this issue. I have a few questions which we should review separate from this discussion. In summary: I plan to define interfaces required for estimation/plp in the StandardizedAnalysisAPI repository. They will be modeled on the OpenAPI specification shared above.

Versioning

In the "Object Model" post above, I outlined 3 functional needs for storing the 'expression' in the cohort/concept set tables respectively and this is what drove me to storing a copy of the cohort/conceptset expressions. As a general principle, I'm all for referencing objects versus creating copies. If you have an alternative design proposal, I'm happy to consider it.

Other

There is a 1:1 relationship between an estimation study and the specification and in theory these 2 tables should be combined into 1. That said, the design I put forward was modeled after the cohort_definition and cohort_definition_details tables in WebAPI. I believe there were practical reasons to separate these into 2 tables: when retrieving the list of cohorts, we didn't need the expression field (not sure if JPA makes it easy or not to exclude a field from an entity conditionally?). Also, the expression field is a varchar(max) and so there may have been some consdierations around table size/performance which were part of the rationale for breaking this into 2 tables. Tagging @chrisknoll for background on that design decision but on the surface, estimation & PLP will have large JSON specifications and I thought it best to follow a pattern of storage that already exists in the platform.

pavgra commented 6 years ago

@anthonysena, regarding the versioning:

When a cohort/concept set is utilized in an estimation study, we'd like to have a 'snapshot' of the definition that does not change. This is not an attempt to address versioning - it is a simplistic approach to ensure that we don't lose a dependency if someone decides to remove a concept set/cohort definition from the system.

Soft deleting is a better approach for preserving access to deleted dependencies as for me.

When we export an estimation design for to JSON, we'll utilize the expression in this table not rely on the referenced cohort/concept set in the repository.

But this point is clearly related to the versioning topic, isn't it?

Down the line we can utilize this expression to detect changes between the cohort/concept set copy kept for estimation and the referenced cohort/concept set in the repository.

The Cohort hash calculation was implemeted during work on CC and is going to be available soon. That's a more proper way to do the detection of changes and would be great to reuse it.

When we import an estimation design, we'll utilize the expression field to store the cohort/concept set expression. Furthermore, we will create the cohort/concept set in the main repository and then establish the linkage as defined in the respective tables. In this way, the estimation specification definition is preserved while enabling the usage of the cohort/concept set with the other ATLAS functions available to these entities.

Again, since we've implemented storage of Cohorts actual hashes, we were also able to provide a way not to create a new copy of a cohort on each import, but to do a lookup of whether an imported cohort already presented in DB (hash of new cohort name + expression equals to hash of any of existing cohorts) and use the link to existing one. The same here - would be great to re-use the CC approach in new PLE / PLP.

pavgra commented 6 years ago

@anthonysena , @chrisknoll , regarding the details tables I strongly believe that proper DB design is a first order thing. And if there cannot be any relations except for 1:1 between the main table and details table then those should be collapsed into a one. Speaking of performance concerns, not only it's possible to have JPA method returning only necessary data (so not attempting to read and return the design column) for the List all case, but also server-side pagination should be not forgotten there.

anthonysena commented 6 years ago

@pavgra - thanks again for your thoughts on the design. In my mind, storing the expression as part of the cohort/concept set is a relatively low priority in the overall scope of the work here. So I could exclude it from the design for now and we could think about introducing a hash or some other way to fulfill this functional need down the line. To further restrict the scope in this regard: we'll solely store references to the cohort definition/concept set IDs and at the time of export, we'll grab the expressions. When importing, we'll create the references to the cohort definitions and import the expressions. We can then evaluate the use of cohort hashes when that is formally in the code base. I'd also favor performing soft deletes on entities versus actual removal from the DB but that is a change that has a large impact and I would not recommend attempting to do that as part of the Symposium release.

I'll look to update the post with the DB entity diagram to reflect these changes.

chrisknoll commented 6 years ago

Hi, @pavgra : some background on splitting the 1:1 relationships to a separate table. This idea orignated many years ago when certain database platforms had 'maximum column width' constraints on table slzes, as well as the potential issue of access performance on tables that have very wide columns (like TEXT or VARCHAR(MAX)). At the time, the solution seemed to be put the main 'summary' columns into a top-level table, and any supporting details (like the cohort expresison) get offloaded to a secondary table which would have differnt storage dynamics due to the storage of LOB/CLOB/Varchar(MAX) columns. So, that was the rationale. If you think that reasoning doesn't hold in today's world of infrastucture, we can put all those 1:1 entities into a single table.

I'd like to press you on the notion of 'proper DB Design' tho: if it's only proper to have 1:1 elements contained in the same table, then why does JPA support splitting 1:1 and referencing those elements via @joinTable annotations?

pavgra commented 6 years ago

Here is a summary of a call w/ @anthonysena and @chrisknoll:

Entity model

Combine estimation and estimation_spec entities into a single table
Drop specification_version from estimation entity

Versioning

Remove expression field form estimation_cohort and estimation_concept_set entity -> store only references to dependencies (we'll go back to real versioning later)

StandardizedAnalysisAPI

@pavgra is going to wrap the StandardizedAnalysisAPI interfaces for PLE / PLP based on @anthonysena branch, OpenAPI spec and posted issue

UI

We are going to use the following rule for components: if a component is specific for some pages section (so not used anywhere else), it should be put under the pages section, and not into the top-level components folder. But if the component is required by several pages sections, then it goes into top-level components folder.

Reuse between Cohort Characterization, TxPathways and PLE/PLP

Calculation and storage of Cohort hashes, which is being implemented in CC, is going to be used in PLE/PLP for changes detection
Some of UI components can also be shared: https://github.com/OHDSI/Atlas/tree/master/js/components/cohort

Hydra

@anthonysena is going to request feedback from @schuemie on the thoughts posted above.

To clarify, typical structure for study repository in my suggested approach can be following:

/MyStudyName
/MyStudyName/definitiom.json (JSON produced by Atlas)
/MyStudyName/README.md (description of the study)
/MyStudyName/run_analysis.R (entry point script)

Whereas the entry point script kicks off execution of the study and allows to override default logic/behavior if necessary:

library(Hydra)

myStudy <- Hydra::loadStudy(definition.json)

// Override any method if required to provide custom logic 
myStudy.createCohorts <- function() {
 // some code
}

myStudy.run()

Thanks @chrisknoll for help on the example above.

t-abdul-basser commented 6 years ago

@anthonysena Thanks for the original proposal. Great work!

chrisknoll commented 6 years ago

I would like to add my support to the notion of a R package written to accept a study specification (as a JSON) and perform the execution. We're moving away from storing the generated SQL inside studies and instead leverage a circe-R-package that will generate the cohort based on the referenced version of the library. I think that this discussion should follow that same trail: That we have a single R library that is able to execute the study definition with no intermediate assets,

schuemie commented 6 years ago

Adding my two cents to the discussion of the packaging approach:

If I understand correctly, there are two options on the table:

The study specification (in JSON) is consumed by Hydra, which creates a study package implementing that study. To run the study, you need to install and run that study package, and all of its dependencies.
The study specification (in JSON) is consumed by a study execution package. This package reads the specifications, and runs the study accordingly.

I actually have some experience with option 2 in the past, with a tool I developed called Jerboa. My experience (which my not apply here) is that every study is different, and that these differences extend beyond what could be specified in Jerboa. Therefore, each new study not only came with a study specification, but also a new version of Jerboa. This obviously defeats the purpose of having a single study execution engine.

This is why in OHDSI we decided for R as the basis for our analytics. The OHDSI Methods Library is a set of R packages with functions that can be used to implement a full study with a few lines of R code, but you still have the full flexibility of R. If one of the provided functions doesn't do the trick, you can add your own.

Recognizing that writing R code might be difficult for some, we now also have functionality in ATLAS that helps write the R code for you. In the next iteration we're now discussing, I imagine the output of ATLAS to change from a single R script to a full R package, using option 1 described above. This output, in my view, is the basis of the real study package. Perhaps for 80% of the studies it is actually enough, certainly for 20% of the studies this basis will need to be extended and modified.

Let's take a concrete example: often a study requires covariates that are not implemented in FeatureExtraction. Luckily, we have the flexibility to generate custom covariates in R (e.g. this custom covariate builder), so we just add it to our study package.

I'm not sure how this would work under option 2. Having a single, monolithic, study execution engine means it will be very hard for people to modify it for the 20% of studies that don't fit the mold. And if they do modify it, we have the same problem I had with Jerboa, where each study comes with its own execution engine.

I'm therefore in favor of option 1. How we deal with dependencies of study packages is I think another discussion.

gklebanov commented 6 years ago

A lot of good discussions here, I will just comment on one for now:

The study code generation is an interesting idea for those folks who want to get their hands dirty with coding and customization (a.k.a. Model-Driven Architecture - MDA), I just do not know why we need to physically generate the same code within ATLAS instead of just bundling a pre-built, re-usable and versioned R component that just takes JSON as an argument. And I do not think we should be changing the signature of the main R component depending on arguments used - this is just another attribute in JSON spec, can address the requirements even for those folks that want to customize code.

So, I am also in favor of the pre-built and versioned "R package written to accept a study specification (as a JSON) and perform the execution". And we also need to allow advanced users to download the code to customize it, where required.

pavgra commented 6 years ago

@schuemie, could you point me why it is going to be hard to modify the 20% of studies, if we'd provide people a way to override default methods? The most popular programming paradigm, OOP (for the note - I love functional paradigm), is working with such approach and developers all across the world do not experience problems.