Properties and linkouts for 'supporting data' items attached to an Edge

‘Supporting data’ broadly defined as information that ‘supports’ the knowledge expressed in an association.
This may include information used as evidence in computation, reasoning, or inference to generate an assertion, or a qualifier/quantifier of its validity/magnitude.
Many diverse data types (>100 - see here, here) are used to support associations in Translator (e.g. p-values, chi-squared statistics, z-scorse, relative odds ratios, effect sizes, Kds, edge weights, . . . )
In some cases, the supporting data is a piece of evidence that was assessed to provide evidence supporting the knowledge expressed in an Association (e.g. an IC50 value from one of many experiments supporting a claim that a chemical inhibits a gene product)
In other cases (e.g. associations directly derived from statistical analysis of data), statistical scores can be considered a quantifier of the magnitude of the observed association (and arguably part of the association semantics). e.g. a Spearman rank correlation value on a GTex data derived gene co-expression association - see here.

At present the Biolink Model provides only a handful of data type-specific edge properties for use by KPs (e.g. p value, chi squared statistic, concept pair count, expected count, ln ratio, ln ratio confidence interval, ...).

In practice, when a specific property is not available for the type of data/score a KP wants to provide, they do one of three things:

Create an unofficial local term for their specific data type to use as their Attribute key
```
attribute_type_id: median_ic50_mut     # not from the Biolink model
value:  2.5363
```
Use a more general Biolink edge property they think is the best fit (e.g. has evidence, 'has numeric value), and indicate the more specific type of value the attribute holds using an ontology term in the Attribute.value_type_id field. (which violates the intended use of this field to hold a more foundational data type, e.g. curie, url, float, etc. - see https://github.com/NCATSTranslator/ReasonerAPI/issues/454)

attribute_type_id: biolink:has evidence
value:  0.0443
value_type_id: OBI_0001191  # IC50

attribute_type_id: biolink:has numeric value
value:  0.6352
value_type_id: STATO_0000085   # effect size

Request a new term be added to Biolink and use this in the data (only a few KPs have done this so far, which is why there are only a handful of specific supporting data edge properties in Biolink).

attribute_type_id: biolink:ln_ratio
value:  0.0443
value_type_id: float

To address these issues, we decided to create separate edge properties for each distinct type of supporting data item or score produced/provided by KPs and ARAs, and name them simple so as not to constrain whether they served as evidence, a quantifying score, or something else . . . . just that they somehow support the knowledge expressed in an Edge. e.g.

has supporting data
     p-value 
     effect size 
     ic50     
     z-score 
     relative odds ratio

We have started to collect a set of data type we need to support in the spreadsheet here - and have code in place to automatically generate Biolink edge properties from this content.

Request: KPs and ARAs populate this spreadsheet with your required supporting data/score types, and clear definitions, so we can provide you with a standard Biolink property to use in your data.

The other piece of the puzzle is to provide end users with documentation about meaning/generation/utility of these data types - which we could provide in a structured way in the message, or link out to using the Attribute.value_url field. In any case, this will require modeling, content and infrastructure support.

Ultimately, we need the user to understand how a data value hanging form an Edge relates to the knowledge expressed in the Edge. And consider if we want to explore a different paradigm than the 'one edge property per data type' approach put forth initially.

The Translator / TRAPI landscape was very different nearly a year ago when we decided to create separate edge properties for each data type (see #7)). At the time, there was concern about the massive proliferation of edge properties this could lead to in the Biolink Model - but in the absence of precedent for or clear advantages of alternatives, we pushed ahead with this approach.

Things have changed since then however, with the introduction of new modeling approaches and paradigms that may have relevance to the representation of supporting data. Specifically:

We have the ability to arbitrarily nest attributes, and seen their use to capture additional information about a top level attribute, and/or logically organize supporting data items to enable clearer understanding of their provenance and significance (e.g. using Study Results)
There is precedent for creating custom structures for specific types of 'Attributes' a. e.g. 'Qualifier' objects b. e.g. 'RetrievalSource' objects to support retrieval provenance (see TRAPI #386)
We have formal support in Biolink for creating and using enumerations to constrain values of fields

We also have gained a better understanding of the diversity/complexity of supporting data information, and how may be used/useful in Translator. Specifically:

We have a better appreciation of the roles that 'supporting' data items can play in different contexts. These data items can be: a. evidence for the statement put forth in an Association (e.g. an IC50 value hanging from a Chemical -affects- Protein edge) b. a quantifier of the strength of the Statement made in an Association (e.g. a correlation score on an X correlated with Y edge) c. an explicit confidence score that reflects how much we believe the Statement made in an Association to be true d. a piece of data reported in a StudyResult that supports/hangs from an Edge (see TMKP and COHD examples here). This data item is likely 'evidence' for the Edge . . . but it is nested in a Study Result object and not directly hung from the Edge.

. . .if we simply have edge properties like 'has_p-value', 'has_z-score', 'has_ic50', etc., consider if / how we indicate the role a data item is playing when we attach it to an Association . . . to users interpret it correctly/in the right context

We have a UI that can give us a sense of how richer information about supporting data items might be useful , and how this information might be presented to user

All of this didn't exist when we made our initial modeling decisions around supporting data representation. We should revisit this decision, and consider alternatives that may be more appropriate/relevant and now feasible to implement in the current landscape.

As of now, we have yet to implement more than a handful of specific supporting data properties in Biolink. The list below mainly reflects those created in response to specific KP requests:

General/test properties we implemented long ago

p value
chi squared statistic

Slots created specifically for the COHD use case

concept count object
concept count subject
concept pair count
expected count
ln ratio
ln ratio confidence interval
relative frequency object
relative frequency object confidence interval
relative frequency subject
relative frequency subject confidence interval

Slots created specifically for the TMKP use case

subject location in text
object location in text
supporting text
supporting text section type
supporting document type
supporting document year
extraction confidence score

But I think many KPs are still using bespoke / non-Biolink properties as Attribute keys, and/or using more general Biolink edge properties like has evidence, or has numeric value, and then indicating the value type elsewhere.

In the short term (for the September release) – should we go ahead and create dedicated data-type specific edge properties in Biolink? If so, how do we decide what properties to create . . . the spreadsheet linked in the comment above is likely way out of date. Can we get a new set of KP requests?

A few open questions re defining these properties:

Naming convention for these properties (verbify with 'has_'? ‘supporting’ prefix?)
Placement of these properties in Biolink model / edge property hierarchy (proposal to group under a common parent (e.g. create a supporting data parent property?)
Relationship of these properties to the existing has evidence property?

If we want to continue down the 'one edge property per data type' approach - we need to catalog and create these edge properties, give them clear definitions, and provide examples/documentation that is accessible to users from the UI - so it is clear how a data value hanging form an Edge relates to the knowledge expressed in the Edge.

However, IMO we should consider if we want to explore a different paradigm that do not require creation of potentially hundreds of datatype-specific edge properties in the Biolink Model. e.g.:

Leverage the nesting capability of Attributes so that information about the data type and/or its utility/interpretation int he context of an edge can be nested below the data value.
Consider leveraging the SupportingStudy modeling pattern that has been applied to capture supporting data for COHD and TMKP edges, in all cases where we want to report supporting data values. This pattern provide leverages Attribute nesting to create a structure that contextualizes the data values .
Extend the existing Attribute schema with additional field(s) that can capture the data type and/or its utility/interpretation int he context of an edge can be nested below the data value.
Defining a new structure/object dedicated to representing supporting data (similarly to the dedicated structures we have created for Qualifiers and Retrieval Sources, to replace the use of Attributes for this metadata).

NCATSTranslator / Evidence-Provenance-Confidence-Working-Group

Properties and linkouts for 'supporting data' items attached to an Edge #7