National-COVID-Cohort-Collaborative / Data-Ingestion-and-Harmonization

Data Ingestion and Harmonization
41 stars 12 forks source link

N3C Extensions to OMOP #43

Open DaveraGabriel opened 4 years ago

DaveraGabriel commented 4 years ago

The N3C project requirements and DI&H processes require data outside a standard OMOP implementation. Primarily this information are data that are created or used by N3C processes in addition tot he source data. Additional data base fields (columns) or tables planned to be included, but are not limited to the following information:

1) Concept / code grouping parent identifiers added as a usability enhancement to the data store. These are "roll-up" codes which provide parent or less specific concepts as coded in a CDM to group data into substantially similar concept groups which associate the data to concepts more amenable to researcher / end user use. For example - many LOINC or NDC codes can be grouped into a more generalizable parent concepts, having little impact on computation.

 2) In cases where there are text string or other information in the source records that cannot be cleanly associated with a concept code in the OMOP vocabulary.  An example of this is in the CDM Data Acquisition process for  [COVID_Elements-2020_05_03.docx]

"Peripheral oxygen saturation (SpO2)/fraction of inspired oxygen (FiO2) [ratio], This ratio is currently not part of LOINC or SNOMED. If it is added to one of those terminologies in the future, this guidance will be updated on how to create an appropriate entry in OBS_CLIN."

DaveraGabriel commented 4 years ago

Agreed, that is a good example of harmonization. I could be convinced to consider a 4th column of HPO reductions. What I had in mind was intermediate, where we preserve the scalar value, rather than a binary variable, but simply put like tests in the same box. Chris

Hi Chris, as you know, this is the whole point of the loinc2hpo library (https://www.nature.com/articles/s41746-019-0110-4). We are now in the process of implementing a Python version that will run on the Palantir site. We could use some help with the planned interface between our system and the Palantir system and would appreciate help/advice. -Peter

Thoughtful comments, Andrew. I agree multidisciplinary teams are what we want to encourage.

I think there is a compromise here. For the most common and important clinical variables, maybe on the order of a few score, I favor creating and forwarding less granular mappings as a third set of columns. As a clinician, I know that values associated with different methodologies, say for serum sodium, are collapsed in graphs and reports in the EHR, since the different methodologies are entirely irrelevant to the clinical interpretation of the values. This is also true for many COVID critical variables such as blood creatinines.

Chris

We concluded our mtg yesterday with a consensus in favor of the utility of an extra column with a preferred mapping for concepts, that might be at a less granular level than the mapped standard concept used in the source data. The thought was this might be another aid to the complex phenotyping work required for model development. A useful default.

I agreed and still think that this is likely to make it easier for researchers to do their work. And I agree that ease of use is important.

The consequences of making the easy thing to do the wrong thing to do are worth considering.

As this paper shows, the difference between features used to develop vs those used to externally validate models is an important source of bias that can seriously decrease predictive model’s value. https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.8183

Their conclusion is one I agree with and one that clearly reinforces the value of the tool Andrew Girven showed: “prediction models should be derived from and validated on datasets collected with measurement procedures that are in widespread use in the intended clinical setting.”

In N3Cs case the development of a model that reflects measurement practices and procedure Versions which are common at my site but uncommon at yours may be “wrong” model development practice we want to discourage. Use of an easy mapped concept option that obscures the distinction between my sites meausrements and yours will discourage a consideration of this potentially very important aspect of phenotype development.

I am not argueing against the extra column with the easy mapping option. But I think it will be good push people to consider the consequences of using it since they may be important.

I hope this type of communication is a useful form of dialog as we try to devise guardrails that strike the right balance between ease of use and best performance and most valid evidence.

I wonder if our thinking about this will best be framed by a research team with the required skills to do the research rather than a hypothetical lone clinician researcher who would likely feel daunted at the task of learning and navigating vocbularies.

We wouldn’t encourage that researcher to do their study without appropriate guidance on Statistical methods. Or if it involved genomics, to do it without the appropriate guidance from someone with that expertise. So it might stand to reason that an informatics person who can help guide the appropriate use of clinical data is similarly to be expected.

Andrew

DaveraGabriel commented 4 years ago

From: jhu-informatics-team@googlegroups.com jhu-informatics-team@googlegroups.com On Behalf Of Christopher Chute Sent: Tuesday, June 23, 2020 11:49 AM To: jhu-informatics-team@googlegroups.com Subject: Case status

On our list of things to do, just go an addition. I am writing my recollection of my entire list, could be added to Github. Case status is the new one.

  1. Aggregating like labs/meds into parent concepts
  2. LOINC to HPO coding
  3. Pre-computing categories of case status, based on Emily’s Github criteria, (definite, possible, probable, etc.) since this requires real date information (things before or after March 1st) and not possible for the Safe Harbor derivative.

Chris

kmkostka commented 4 years ago
  1. Generally speaking, Concept Sets are intended to be a way to group together related terms into a unit that you can then put them into logic. They are intended to be a functional equivalent to“clinical groupers” or “code groups” for use in analytics. The creation of a grouping that looks at things like: "Diabetes" or "HbA1c-values above a specified threshold" is veering into a rule-based cohort definitions. Both can be created as JSON objects and used in the Enclave. You could create derivations of these in analytical tables in Palantir. I would not store it in the CDM itself because it's more intended for a results schema (e.g. it can change over time based on what's in your CDM, it's not actually a data element of the CDM). I would be happy to train the Analytics team on how to create these artifacts and how you can store them as metadata in your Enclave analytics workspace. It is not as complicated as it sounds and would create significantly more transparency into how we generated a pre-computed variable.
  2. @empfff may want to weigh in on the use of the labels proposed. I think these labels are not validated enough to be used at scale. It would be safer to let people build Cohort Definitions that can be shared in the Enclave to show transparency in how someone meets inclusion or exclusion in a specific label.
  3. Unmapped or missing concepts not currently supported in a standard (LOINC, SNOMED, etc) can be custom mapped to the OMOP Vocabularies. We do this all the time with EHR data. The OMOP Vocabulary team would build a N3C Vocabulary that ascribes meaning to these unmapped strings and coordinates them to the relevant concepts. You can order these maps by submitting a use case and list of unmapped values to the OMOP Vocabulary WG (Clair can help direct you to the right resources). It will be triaged and addressed.
empfff commented 4 years ago

I guess I'll add that we never really obtained full community agreement on the categories in the phenotype--one woman's "suspected" is another woman's "possible," for example. I agree with Kristin's suggestion to represent these labels in shared cohort definitions rather than persisting them in the database.

cgchute commented 4 years ago

Fair enough on the case status categories, I had thought there was more consensus. I do believe that pre-computing some lab parents (all those blood creatinine orphans) would be useful, particularly for the elements used in the characterization paper.