biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
170 stars 71 forks source link

Map EFO terms to Biolink Model #493

Closed deepakunni3 closed 3 years ago

deepakunni3 commented 3 years ago

Is your feature request related to a problem? Please describe.

Map EFO terms to Biolink Model.

For example,

What working group (or team) did this request originate from?

Genetics Provider/Molecular Data Provider

Describe the solution you'd like

Add mappings to EFO, while adding new grouping classes where applicable.

Tag relevant members for discussion

@RichardBruskiewich and ‪Marcin von Grotthuss

RichardBruskiewich commented 3 years ago

The ontology browser for EFO is at https://www.ebi.ac.uk/ols/ontologies/efo. We can leverage the hierarchy to start mapping terms into the Biolink Model.

RichardBruskiewich commented 3 years ago

First, note that this issue/PR does relate a bit to the general issue of modelling clinical data with cohorts, outcomes and exposures. However, the focus in this issue is more on genetic associations: genetic variants associated with *something".

In assessing the overall use cases, we note that there are three levels of data/information/knowledge to consider here:

  1. disease (diagnoses) and/or clinical measurements alongside genotyping data representing "raw" clinical data in the original patient profiles of individuals.

  2. aggregation and indexing of such clinical data from the first level into a dataset representing at least one cohort, then processing such data using statistical algorithms such as "MAGMA" (and other GWAS algorithms) to assign a statistical score for the association of specific (DNA sequence) variants with specific diagnostic ("disease") status or specific qualitative or quantitative values (or value ranges) of clinical measurements.

  3. selecting a specific score threshold from the datasets generated in level two, make "variant-to-trait" knowledge assertions where the variant is the subject concept, the trait is the object concept, and the 'predicate' is a genetic association with a qualifier score of 'confidence' of the assertion, and 'context' of the GWAS study (and underlying 'cohort study' or 'studies') which evidentially support the assertion.

During discussions with Marcin, it was discerned that "Level 1" is buried deep inside the Genetics Knowledge Provider original "knowledge sources" and will not be exposed to translator. Translator will only see Level 1 and Level 2 information. More specifically, level 2 information from the Translator Genetic Knowledge Provider (GKP) is typically GWAS analysis derived datasets generated by MAGMA.

The remainder of this discussion here assumes that perspective.

Upon review of the list of EFO terms submitted by Marcin von Grotthuss as terms used in "knowledge" being returned by the GKP, we discerned that some of the terms were instances of "diseases" (or disease-like) while other terms were "clinical measurements".

The "disease" EFO terms are effectively boolean "presence/absence" characteristic trait values of the cohort(s) whose disease status and genotypes were inputs to GWAS analysis.

The "clinical measurement" EFO terms are not referring to raw clinical values but rather, represents some qualitative or qualitative value (or range of values) of a clinical measurement treated as biological 'trait' values which, alongside cohort genotypes, were also submitted to GWAS analysis.

The output of the GWAS analyses in both cases are "variant to trait" processed GWAS genetic associations, where the specific 'subject concept' genetic variant is associated - with some "statistical" level of confidence or "likelihood"- with a specific 'object concept' that is specifically that represents a given disease status or biological 'trait' value.

In both cases, there is an implied (level 2 information) cohort lurking as the source of the value, hence, the 'context' of validity of the value. In that sense, such a cohort (or rather, its study?) is the source of evidence/provenance lurking behind the genetic association, which itself is exported by the KP as a (level 3) knowledge assertion (knowledge graph edge). Note that a biolink:Cohort concept category is soon to likely be added to the Biolink Model by the related PR https://github.com/biolink/biolink-model/pull/494.

In practical terms, each 'object concept' disease in the resulting Biolink Model compliant knowledge assertion is a specific biolink:Disease category node (the specific disease identified by EFO 'disease' term) or a 'trait value' category node. However, we do not (yet) have a category called 'trait value' so rather, the current decision is to use suitably qualified instances of the concept category of biolink:PhenotypicFeature as the proxy concept class for 'biological trait'.

By 'suitably qualified' we mean that in addition to using the specific EFO term associated with the given clinical measurement of the genetic association, we may occasionally also need to indicate that the trait is given a has attribute measurement to constrain the concept node to some particular quantitive or qualitative trait clinical measurement value (but not necessarily: GWAS analysis doesn't care about specific values of the measurements, just that "some variation" in the expression of the genome nearby the SNP marker correlates with the values of the measurements. That is, it doesn't say anything about the absolute size or direction of the measurement relative to "random" values of the variable, just that the variation in the measurement correlates with the presence or absence of the SNP, whether or not the SNP has a direct "functional" (coding or regulatory) impact on the expression of the genome).

The association itself will document the GWAS score of the association, as a measure of confidence of the genetic association itself.

We will strive to clean up the Biolink Model Pull Request 497 to simply reflect and support the above understanding of the Genetics Knowledge Provider use cases using EFO.

Note that although the PR will include a Biolink Model patch to support the initial comment by @deepakunni3 above, namely

Map EFO term for measurement by creating biolink:Measurement

and

Map EFO term for phenotype with biolink:PhenotypicFeature,

in fact, the modifications required to support the above Genetics Provider use case does not rely on these specific changes to the Biolink Model. Rather, the use cases rely specifically on the assignment of EFO as a id_prefixes namespace to the biolink:Disease and biolink:PhenotypicFeature category classes, plus clarification of the appropriate association class including applying constraints on object concept nodes relating to biolink:PhenotypicFeature qualifying trait values.

RichardBruskiewich commented 3 years ago

Merged this issue and the associated PR (#497) to issue https://github.com/biolink/biolink-model/pull/494 for a coordinated treatment of clinical data modelling.