NCATSTranslator / Evidence-Provenance-Confidence-Working-Group

MIT License
1 stars 1 forks source link

Properties and linkouts for 'supporting data' items attached to an Edge #7

Open mbrush opened 2 years ago

mbrush commented 2 years ago

At present the Biolink Model provides only a handful of data type-specific edge properties for use by KPs (e.g. p value, chi squared statistic, concept pair count, expected count, ln ratio, ln ratio confidence interval, ...).

In practice, when a specific property is not available for the type of data/score a KP wants to provide, they do one of three things:

  1. Create an unofficial local term for their specific data type to use as their Attribute key
    attribute_type_id: median_ic50_mut     # not from the Biolink model
    value:  2.5363
  2. Use a more general Biolink edge property they think is the best fit (e.g. has evidence, 'has numeric value), and indicate the more specific type of value the attribute holds using an ontology term in the Attribute.value_type_id field. (which violates the intended use of this field to hold a more foundational data type, e.g. curie, url, float, etc. - see https://github.com/NCATSTranslator/ReasonerAPI/issues/454)
attribute_type_id: biolink:has evidence
value:  0.0443
value_type_id: OBI_0001191  # IC50

attribute_type_id: biolink:has numeric value
value:  0.6352
value_type_id: STATO_0000085   # effect size
  1. Request a new term be added to Biolink and use this in the data (only a few KPs have done this so far, which is why there are only a handful of specific supporting data edge properties in Biolink).
attribute_type_id: biolink:ln_ratio
value:  0.0443
value_type_id: float  
mbrush commented 2 years ago

To address these issues, we decided to create separate edge properties for each distinct type of supporting data item or score produced/provided by KPs and ARAs, and name them simple so as not to constrain whether they served as evidence, a quantifying score, or something else . . . . just that they somehow support the knowledge expressed in an Edge. e.g.

has supporting data
     p-value 
     effect size 
     ic50     
     z-score 
     relative odds ratio

We have started to collect a set of data type we need to support in the spreadsheet here - and have code in place to automatically generate Biolink edge properties from this content.

Request: KPs and ARAs populate this spreadsheet with your required supporting data/score types, and clear definitions, so we can provide you with a standard Biolink property to use in your data.

mbrush commented 2 years ago

The other piece of the puzzle is to provide end users with documentation about meaning/generation/utility of these data types - which we could provide in a structured way in the message, or link out to using the Attribute.value_url field. In any case, this will require modeling, content and infrastructure support.

Ultimately, we need the user to understand how a data value hanging form an Edge relates to the knowledge expressed in the Edge. And consider if we want to explore a different paradigm than the 'one edge property per data type' approach put forth initially.

mbrush commented 1 year ago

The Translator / TRAPI landscape was very different nearly a year ago when we decided to create separate edge properties for each data type (see #7)). At the time, there was concern about the massive proliferation of edge properties this could lead to in the Biolink Model - but in the absence of precedent for or clear advantages of alternatives, we pushed ahead with this approach.

Things have changed since then however, with the introduction of new modeling approaches and paradigms that may have relevance to the representation of supporting data. Specifically:

  1. We have the ability to arbitrarily nest attributes, and seen their use to capture additional information about a top level attribute, and/or logically organize supporting data items to enable clearer understanding of their provenance and significance (e.g. using Study Results)

  2. There is precedent for creating custom structures for specific types of 'Attributes' a. e.g. 'Qualifier' objects b. e.g. 'RetrievalSource' objects to support retrieval provenance (see TRAPI #386)

  3. We have formal support in Biolink for creating and using enumerations to constrain values of fields


We also have gained a better understanding of the diversity/complexity of supporting data information, and how may be used/useful in Translator. Specifically:

  1. We have a better appreciation of the roles that 'supporting' data items can play in different contexts. These data items can be: a. evidence for the statement put forth in an Association (e.g. an IC50 value hanging from a Chemical -affects- Protein edge) b. a quantifier of the strength of the Statement made in an Association (e.g. a correlation score on an X correlated with Y edge) c. an explicit confidence score that reflects how much we believe the Statement made in an Association to be true d. a piece of data reported in a StudyResult that supports/hangs from an Edge (see TMKP and COHD examples here). This data item is likely 'evidence' for the Edge . . . but it is nested in a Study Result object and not directly hung from the Edge.

. . .if we simply have edge properties like 'has_p-value', 'has_z-score', 'has_ic50', etc., consider if / how we indicate the role a data item is playing when we attach it to an Association . . . to users interpret it correctly/in the right context

  1. We have a UI that can give us a sense of how richer information about supporting data items might be useful , and how this information might be presented to user

All of this didn't exist when we made our initial modeling decisions around supporting data representation. We should revisit this decision, and consider alternatives that may be more appropriate/relevant and now feasible to implement in the current landscape.

mbrush commented 1 year ago

As of now, we have yet to implement more than a handful of specific supporting data properties in Biolink. The list below mainly reflects those created in response to specific KP requests:

General/test properties we implemented long ago

Slots created specifically for the COHD use case

Slots created specifically for the TMKP use case

But I think many KPs are still using bespoke / non-Biolink properties as Attribute keys, and/or using more general Biolink edge properties like has evidence, or has numeric value, and then indicating the value type elsewhere.

In the short term (for the September release) – should we go ahead and create dedicated data-type specific edge properties in Biolink? If so, how do we decide what properties to create . . . the spreadsheet linked in the comment above is likely way out of date. Can we get a new set of KP requests?

A few open questions re defining these properties:

mbrush commented 1 year ago

If we want to continue down the 'one edge property per data type' approach - we need to catalog and create these edge properties, give them clear definitions, and provide examples/documentation that is accessible to users from the UI - so it is clear how a data value hanging form an Edge relates to the knowledge expressed in the Edge.

However, IMO we should consider if we want to explore a different paradigm that do not require creation of potentially hundreds of datatype-specific edge properties in the Biolink Model. e.g.:

  1. Leverage the nesting capability of Attributes so that information about the data type and/or its utility/interpretation int he context of an edge can be nested below the data value.
  2. Consider leveraging the SupportingStudy modeling pattern that has been applied to capture supporting data for COHD and TMKP edges, in all cases where we want to report supporting data values. This pattern provide leverages Attribute nesting to create a structure that contextualizes the data values .
  3. Extend the existing Attribute schema with additional field(s) that can capture the data type and/or its utility/interpretation int he context of an edge can be nested below the data value.
  4. Defining a new structure/object dedicated to representing supporting data (similarly to the dedicated structures we have created for Qualifiers and Retrieval Sources, to replace the use of Attributes for this metadata).