OHDSI / WebAPI

OHDSI WebAPI contains all OHDSI services that can be called from OHDSI applications
Apache License 2.0
126 stars 156 forks source link

New Cohort Characterization #495

Closed pavgra closed 5 years ago

pavgra commented 6 years ago

Today there are several mechanisms in Atlas which provide cohort characterization: Heracles (adoption of Achilles for cohorts), Feature extraction (Generate w/ Features) and Cost & Utilization reports. All of them try to accomplish the same goal, but at the same time introduce extra complexity both to business users and developers since being placed in different parts of the application and doing their job in different ways. This is not obvious which to choose and which to extend during development. The suggestion is to provide a single mechanism which will support both current needs and be simply extendable by business users (ad-hoc) and developers.

The core idea is to re-use https://github.com/OHDSI/FeatureExtraction package by providing a user options either to select existing FE analyses, create analysis on the fly based on Criterias or raw SQL:

atlas cohort characterization use cases

First thing is to provide a library of FE analyses and ability to create new ones:

atlas cohort characterization fe analysis sequence

When a user has available analyses, he/she can create a Cohort Characterization report by picking cohorts to analyze, analyses set and configuring the analyses parameters:

atlas cohort characterization cc sequence

Database for such solution could look like: atlas cohort characterization database

DAO layer:

atlas cohort characterization data classes

Service layer

atlas cohort characterization service classes

Original discussion of the solution was held at Janssen R&D. Following contributors took participation

@pbr6cornell, @gowthamrao, Dave Kern, @chrisknoll, @pavgra.

gowthamrao commented 6 years ago

Excellent work @pavgra , this is in-line with decision to remove and refactor the three cohort characterization mechanisms.

Diagram 1: User options

  1. Can we clarify some concepts e.g. what are 'cohort characterization list', 'Feature analyses list', 'cohort characterization object', 'analysis domain'. Defining these would make discussions easier.

Diagram 2: Library of FE analyses

  1. We should probably describe that we want to introduce two new subtypes of methods to compute cohort level features. They are Criteria expression and raw SQL. Current functionality of FE will be called regular. Maybe we could use 'concept_id as feature' to describe regular, 'criteria expression as a feature' for criteria expression and 'custom query as feature' for raw SQL? This may make it less ambiguous?
  2. Analysis design -do we need some form of validation check? i.e. does the raw query work? e.g. we could execute the query with cohort_definition_id = -999 to do a syntax check?

Diagram 3: Cohort characterization report

  1. I think there has be some form of default cohort characterization. This is probably the current state of 'concept_id as feature'

Diagram 4: Database solution

  1. One cohort (cohort.id) has many cohort_characterizations_id's (cohort_characterizations_cohorts.cohort_characterization_id). In this design, cohort_characterizations - seems to retain only one (current) version of cohort_characterization. There is no versioning of cohort_characterization. Do we need to create versions or retain history?
  2. fe_analyses - is this only for 'regular' or 'concept_id as feature'.? I dont think it can support the other two - 'criteria expression as a feature' and 'custom query as feature'?
  3. Could we add to cohort_characterization_generations last_run_by: integer to track who ran the cohort characterization generation?
  4. cohort_characterization_results: is time_window to stratify by intervals of time such as week, month, quarter, year?

Diagram 5: DAO layer

  1. Just a reminder that we are supporting multiple dbms instances e.g. redshift and postgres. So we have to make sure the design supports those
pavgra commented 6 years ago

@gowthamrao, I see the concepts in following way

Analysis domain

Initially, distinct list of values from the analysisId column in https://github.com/OHDSI/FeatureExtraction/blob/master/inst/csv/PrespecTemporalAnalyses.csv

Feature Analysis

An entity, representing an analysis to be run using Feature Extraction package. Besides descriptive fields, it contains type and design. Three design types are going to be available. And for each of them the value field is going to be enough for design storage:

I dont think it can support the other two - 'criteria expression as a feature' and 'custom query as feature'?

You can see above that the suggested structure is sufficient and flexible enough to contain all analyses types. And the type of specific entry is defined by type field

Analysis design -do we need some form of validation check?

I would not include analysis design validation into first version of the Cohort Characterization because it will already require a lot of effort to develop.

I suggest to initially populate the Feature analysis list with all FE package regular analyses.

Cohort Characterization

An entity defining which Feature analyses to execute over a (set of) cohort(s). Also contains values to setup FE package variables.

There is no versioning of cohort_characterization. Do we need to create versions or retain history?

Again, as we discussed in person, I'd treat it as separate task. History of executions should be enough for the first time.

Could we add to cohort_characterization_generations last_run_by: integer to track who ran the cohort characterization generation?

Reasonable, agree.

Just a reminder that we are supporting multiple dbms instances e.g. redshift and postgres. So we have to make sure the design supports those

This is achieved by using FE package and templated OHDSI SQL for custom analyses.

pavgra commented 6 years ago

Following outcomes are from discussion with @chrisknoll :

Decisions:

Open questions:

chrisknoll commented 6 years ago

Hi everyone, I think after our discussion we have some baseline functionality nailed down that we can begin implementing.

@pavgra : there was one thing that Patrick mentioned back at the F2F: the idea around 'criteria sets' which can possibly be referred to as 'feature sets' in characterization: it's a way of grouping criteria expressions together so that when you bring in a criteria-set/feature-set of 'FDA Specified Age Groups', it returns the set of criteria that buckets people into the FDA-specified age group buckets (like a 0-14, 15-18, 19-21, 22-36, 37-64, 65+ set of criteria). The notion of feature-sets is not defined in the object model anywhere, so if we want it, we'll have to figure out how to capture it.

pavgra commented 6 years ago

@chrisknoll, I assume that there is no need in "Feature sets" - we can just add a separate type of the Feature entity like "Criteria set". Or we can even make the "Criteria set" to be the one and only feature criteria type instead of having separate "Criteria" and "Criteria sets". Continuing the topic, I would suggest to create a separate table for the criterias so that later we can introduce criteria library / suggestions easily.

gowthamrao commented 6 years ago

@pavgra thank you, so it sounds like:

I think you mean that when fe_analyses.type = regular the fe_analyses.@domain_id = domain_id and fe_analyses.value = @analysisName fe_analyses.descr= @description where @domain_id, @analysisName and @description are from PrespecTemporalAnalyses.csv - could you please confirm.

when fe_analyses.type != regular i.e. it is of type = 'Criteria expression as feature' or 'Custom query as feature' -- the information about the expression or the query is in the value field?

pavgra commented 6 years ago

@gowthamrao, yes, this is what I suggested. Do you see any problems with the approach?

gowthamrao commented 6 years ago

@pavgra that segment of the solution sounds fine to me. Not having versioning for now is also fine for first implementation.

Agreed on maintaining a library of Feature analysis

what does that mean?

Regarding cohort_characterization_results

pavgra commented 6 years ago

@gowthamrao ,

what does that mean?

This means that there will be a list of Feature analysis (similar to Concept sets / Cohorts / etc)

cohort_characterization_results.covariate_id and cohort_characterization_results.covariate_name is it better to have them in two different tables?

may be, let me do DB normalization after general logic is settled

cohort_characterization_results.time_window

I'd ask @chrisknoll for expert opinion here

Why do we need this in cohort_characterization_results when it is in fe_analyses?

The analysis_id meant to reference feature_analyses.id. I'll add missed foreign key there

pavgra commented 6 years ago

Cleaned up and updated the diagrams:

atlas cohort characterization class diagram atlas cohort characterization database

@chrisknoll , @gowthamrao

pavgra commented 6 years ago

Updated use-cases diagram with some new cases and better visual grouping:

atlas cohort characterization use cases

pavgra commented 6 years ago

@pbr6cornell, @gklebanov, @chrisknoll, we've already started the discussion a couple of times, but, I guess, it is a good time to continue not only the talk but even move further with actual steps: I'm speaking about analyses design standardization and an ability of sharing of the specifications between applications. Cohort characterization analysis seems to be a good candidate for this. My suggestions is: similarly to Cohort definition, create a package inside circe-be (org.ohdsi.circe.cohortcharacterization) to hold Cohort Characterization defining interfaces (interfaces is a better way to define standardized parts + they still can be properly annoted for serialization) and inherit them for CC design classes used in WebAPI. Later, the interfaces could be re-used by another apps and this would force real collaboration not only at data level (CommonDataModel) but in app dev too. Pls, get a look at the possible implementation of the idea described above:

atlas cohort characterization class diagram

pavgra commented 6 years ago

Some initial mocks

Cohort Characterizations section

characterizations_list

Cohort Characterization view-edit page

view_edit_feature_analysis

pavgra commented 6 years ago

Note regarding Cohort Characterization interfaces package: all methods should be annotated with proper serialization (jackson) annotations so that it would be possible to serialize the interfaces and get standard analysis JSON definition.

Also, a couple of questions were raised during discussion with @chrisknoll and @gklebanov :

gowthamrao commented 6 years ago

Is there a need to include cohort expressions into CC analysis definition?

From a UX standpoint: We decided to separate the UI for building the cohort from the UI for cohort characterization (CC) - as in @pavgra mock-up. This UX is similar to incident rate analysis - i.e. we refer cohort expressions by cohort_definition_id.

Regarding the two options - embedding or seperating cohort expressions and CC expressions have pros and cons. It's the same concern as concept-set expressions. While we like the idea of creating an independent library of concept-set expressions, we embed the concept set expression into cohort expression - so that cohort expression is independently exportable as an encapsulated object.

If we update the concept-set expression library, we need to then go to each cohort expression and update the embedded concept set expression. This makes maintenance a nightmare. At the same time, if we don't embed - then we can't ensure reproducibility.

I like embedded and encapsulated approach - but I would also like a functionality that does some form of comparison between embedded and library expressions, so that the user may refresh/update the embedded with the version in library using an easy to use format. If we can make it easier to maintain the embedded expressions, thru an easy functionality that leverages the expression library (like concept set expression), that would be best imo.

davekern commented 5 years ago

I like embedded and encapsulated approach - but I would also like a functionality that does some form of comparison between embedded and library expressions, so that the user may refresh/update the embedded with the version in library using an easy to use format. If we can make it easier to maintain the embedded expressions, thru an easy functionality that leverages the expression library (like concept set expression), that would be best imo.

Agree with this. I'd like to see it function similar to the cohort/concept-set relationship that @gowthamrao is talking about. And it would be great if the tool could check whether the cohort definition that is embedded is current with the library version, and have the ability to refresh as necessary. @chrisknoll this is similar to discussions we were having at Janssen about what we'd like the cohort creation tool to do, where it performs certain checks automatically and creates a log of that information (e.g., redundant or unused concept sets, logic errors, etc.).

davekern commented 5 years ago

We need a place to put the CC interfaces and circe is not the best one. So the idea is to create separate multi-project repo where we could put both circe and CC related stuff and later continue using it for analyses design related things

@pavgra I like this idea because the final phase of this work is going to be full blown analyses of multiple outcomes between the two [matched] cohorts which may be contained as an added feature to this cohort characterization, or it may be something separate. If we have some repo where these interface designs are stored and can be reused it will be useful for when we get there.

t-abdul-basser commented 5 years ago

@pavgra I approve of the idea of standardizing analysis designs and facilitating sharing of standardized designs by using Jackson annotated interfaces.

pavgra commented 5 years ago

@pbr6cornell, @gowthamrao, @chrisknoll, @davekern, since we do not have any significant disagreement on the CC design, Odysseus is going to start work on the feature (actually, UI is already being coded). We will post updates and screenshots during the development process.

anthonysena commented 5 years ago

@pavgra this is a great thread - thank you for your work in documenting the design. As I was reviewing this issue, I wanted to further surface the point of embedding a cohort definition in the cohort characterization versus providing a referencing the definition. This is something that will be needed in several areas of Atlas (IR, estimation, PLP, etc) and so I would like to make sure I'm clear on how it will be approached here with the hope that we can have a consistent approach with other places in the platform.

From my understanding of the ER diagram, it looks like cohort characterization will reference the cohort definitions that are utilized in the analysis (table: cohort_characterization_cohorts). This is a similar approach to what we've done to date for estimation and PLP. One challenge with this approach comes when a user removes a cohort definition from the system - we're left with an orphan record.

One thought I've had to address that challenge: store a copy of the cohort definition in the cohort_characterization_cohorts table when saving. The ideas is that when a cohort is added to the analysis, the definition is copied. Subsequent updates will skip the record to preserve the definition at the time of import. This enables the potential to compare a cohort copy to its originating definition in the repository (or re-establish the cohort if it is removed). Furthermore, it could be useful when importing an analysis definition from another environment whereby the analysis contains an embedded cohort definition and the import function creates the cohort definition and linkage to the new analysis.

This may be 'out of scope' for cohort characterization but thought it would be useful to raise awareness and potentially think about how this would be stored, even if we decide to hold off on this until a later date.

pavgra commented 5 years ago

@anthonysena, my vision of the ref vs copy topic is the following: First, proper DB design from normalization perspective is referencing. What is more, referencing is required anyway because you may want to see whether referenced entities have changed since the moment of the latest update of a current entity (Cohort Characterization in our case). Second, there is a need of copy from a versioning perspective, so basically there is a need in versioning mechanism. So these are two different things which should have their own solutions and which I'd not mix. A universal mechanism for versioning should be discussed, designed and applied over properly designed relational database.

anthonysena commented 5 years ago

@pavgra thanks for your perspective here. I'm all for referencing entities and that's the current approach for estimation & PLP. I also agree that we need a holistic approach to versioning. So, I'm fine to table the idea I proposed until a later time when we can discuss versioning.

For cohort characterization, have you thought about how an analysis will be represented so that it can be transported across environments? This is currently a challenge for estimation & PLP in part because we do not embed the cohort definitions as part of the exported definition. I'm thinking that estimation/PLP analyses will require REST endpoints that can export the full specification and import the full specification (and create the cohort references as part of the import operation). Do you have a similar requirement for cohort characterization?

pavgra commented 5 years ago

@anthonysena, that's a tricky question: from one side, would be great to give a possibility for a user to copy-paste a CC entity, so there is a need to embed cohorts, however from the other side, CC itself is analysis and cohorts are just its input, so they shouldn't be included in design. So my latest thoughts on this: let's add a switch to Export page (so a flag to REST endpoint) which will dictate whether its required to embed or exclude cohort definitions from the exported CC definition.

gowthamrao commented 5 years ago

let's add a switch to Export page (so a flag to REST endpoint) which will dictate whether its required to embed or exclude cohort definitions from the exported CC definition

My understanding is that if we enable this switch in Export - then we could import it to Atlas B, and now Atlas B would have a fully specified definition and can generate the cohort and characterize the cohort (which is awesome).

Then, i think - there will not be a cohort_definition_id and a record in the cohort UI of Atlas B?

pavgra commented 5 years ago

@gowthamrao , since we agreed that CC doesn't store copies of cohorts, there will be records in the cohort browser of Atlas B

pavgra commented 5 years ago

Some screenshots of upcoming Cohort Characterizations (markup only yet). Feedback is appreciated

atlas - characterizations list

atlas - feature analyses list

atlas - characterizations view edit - design

atlas - characterizations view edit - executions collapsed

atlas - characterizations view edit - executions expanded

atlas - characterizations view edit - utils - json

gowthamrao commented 5 years ago

@pavgra shouldnt we have two sections in Atlas?

  1. Covariate Definition
  2. Cohort Characterization/Design

Covariate Definition: this would be the covariate definition section. (similar to concept-sets) i.e. these two image image

Cohort Characterization: this would be where we select a cohort and pull down its definition + select one or more covariate definitions from the Covariate Definition section to characterize the cohort image

In Covariate definition the user will be able to do the following functions (from your functional diagram above)

In Cohort characterization - user has following functions

(ok - maybe i am asking for too much)

gowthamrao commented 5 years ago

Note: we use covariate/feature interchangeably here. I propose that we use the term Features for FeatureExtraction like 'concept_id' level attributes, vs. covariate for complex expression based attributes e.g.

Custom sql is a covariate

chrisknoll commented 5 years ago

Note: we use covariate/feature interchangeably here. I propose that we use the term Features for FeatureExtraction like 'concept_id' level attributes, vs. covariate for complex expression based attributes e.g.

I don't think we should separate those terms like that. A co-variate could come from a simple 'existence of concept ID within past 365d' or 'execute custom sql to establish existence of something in past 365d', or 'is anything present in 365d before to 0d before index' via criteria expression.. They state the same thing, just yielded in different ways. Covarates are, i believe literally, variables used together in a model. Features are just a general term for a 'population characteristic. Features are used as covaraites, so that's the connection there. How they are being defined or calculated is secondary to their use or 'what they are'.

We do have pre-packaged features (from the defaults from FE2), we have criteria-based features (driven from CIRCE-based criteria expressions), and custom SQL (write any SQL you want to indicate a person has presence of, or value of, a feature). If we want to divide up features that way for Cohort Characterization, that makes sense to me.

pavgra commented 5 years ago

@gowthamrao, speaking of two separate sections: I'd not overcrowd the sidebar with a lot of elements until there is a clear need in it. The features are not used at this point anywhere outside of CC, so not sure whether it's reasonable to put them separately.

Regarding CC tabs: I've split things in the way of how workflow goes. First - design (selection of cohorts, features, parametrization), then - execution (and results are an outcome of an execution, so the results page is going to be nested under the executions tab; similar to generate w/ features, PLE, PLP; just hasn't marked it up yet), last - utils w/ export and import.

One more clarification: I haven't worked on Feature design page yet (similar to the CC view/edit page) which is going to appear when you click a link in the Features table.

gowthamrao commented 5 years ago

Please see initial attempt for text to describe functionality of various sections in new cohort characterization here. Once we finalize, we could put them on the mock-up to replace the 'Lorem Ipsum dummy text'

jhardin29 commented 5 years ago

Hi everyone this is really great; I'm looking forward to this functionality. @fdefalco thank you for raising this thread for us to review.

I have a few specific questions/comments about how the new characterization features will function.

  1. It looks like a feature will be to allow the user to specify specific domains and time windows for characterization and there will be some pre-specified features? Will the user be able to specify that they want counts of people vs. counts of occurrences?

  2. Can an option be available when the user selects a given cohort that the characterization can either be on the index or the matched cohort? I suspect not but I thought I would ask. If this were possible this would allow us not to have to make 2 copies of a cohort one at the index level and one after implementation of inclusion/exclusion rules; thus would ultimately avoid cluttering up the cohort section of ATLAS.

  3. I see there is functionality planned to export the characterization results which is great as I anticipate using this aspect.

  4. Can we run the characterization on a given cohort for differing data sources and then have a utility to compare the results between data sources and then export each and the comparison? Maybe this is asking too much - and more along the lines of cohort comparison feature....

Can't wait to try this new feature!

pavgra commented 5 years ago

Results page markup. A couple of cohorts, several reports:

atlas - characterizations view edit - results - all tables

Filter usage:

atlas - characterizations view edit - results - filtered tables

Markup for show as charts mode is coming.

chrisknoll commented 5 years ago

Functionality Comment: It might be useful to show prevalence (binary) excoriates side-by-side. In the tabular report, if the selected cohorts could be shown side-by-side (count and %) based on user's selection of cohorts of interest, the report won't feel so 'long' if there are > 3 cohorts in the results. It might be necessary to give the user a choice (another function) to show the results as comparative or concatinated.

pavgra commented 5 years ago

@chrisknoll, thank you for the suggestions, I'll work on it!

davekern commented 5 years ago

Just getting caught up on this after being out of the office last few days. @pavgra I agree with @chrisknoll comment about having cohorts side-by-side to help readability. One more suggestion: to reduce the number of filters could we just have a handful: 1 for all demographics (rather than separately having to select age and gender), 1 for all conditions (e.g., charlson score and individual comorbidities), 1 for medication use, 1 for utilization (e.g., presence of hospitalizations, outpatient visits, ER visits). then when we get to the point of incorporating continuous data (e.g., # of visits, costs, # of medication fills, etc.) we can just add to those subreports rather than having even more filters.

davekern commented 5 years ago

Please see initial attempt for text to describe functionality of various sections in new cohort characterization here. Once we finalize, we could put them on the mock-up to replace the 'Lorem Ipsum dummy text'

@gowthamrao I made some minor edits and left one comment in there about the cohort inclusion. I think that can be deleted since it will depend on how the user defines their cohort which is separate from the cohort characterization.

pavgra commented 5 years ago

@chrisknoll, what do you think of this?

cc-display-modes

chrisknoll commented 5 years ago

Looks good, but we have comments from our internal channels that you have:

davekern commented 5 years ago

To add to @chrisknoll comment. If the user chooses exactly 2 cohorts it should also calculate standardized differences for each of the characteristics... in the side-by-side view this would just be another column to the far right.

pavgra commented 5 years ago

As discussed check out scatterplot and std diff calculation for cohort comparison:

image

And simple tables when single cohort selected:

image

davekern commented 5 years ago

new updates look good. The presentation of the two cohorts side by side is great. How does that handle continuous data like the charlson score you have in the 1 cohort view. Can we get a line running 45 degrees (i.e., x=y) through the scatter plot?

pavgra commented 5 years ago

@davekern, I've updated the scatter plot based on your comments:

image

pavgra commented 5 years ago

Following are actions items out of call w/ @cgreich, @chrisknoll, @gowthamrao, @davekern, @anthonysena:

gowthamrao commented 5 years ago

Please add export options (csv, png).

pavgra commented 5 years ago

Updated UI based on the comments above (except for reorder button and export):

atlas - characterizations view edit - results 3

davekern commented 5 years ago

Looks real good. But if we have a lot of continuous variables that's not going to scale well. Can we turn the box plot on its side and squeeze it down a bunch? We'll also need to label those box plots so we know what each one is for (e.g., Charlson score)

chrisknoll commented 5 years ago

I'll have to work on an option for the boxplot to support a 'horizontal' orientation.

pavgra commented 5 years ago

@davekern, why wouldn't it? I doubt that we'll have any win in height if we switch from 2 columns to 2 rows for distribution stats. Also, the name is there, above the table, or what was meant?

pavgra commented 5 years ago

Interfaces for Cohort Characterization design and result classes were pushed to: https://github.com/OHDSI/StandardizedAnalysisAPI