Open mbrush opened 2 years ago
To address these issues, we decided to create separate edge properties for each distinct type of supporting data item or score produced/provided by KPs and ARAs, and name them simple so as not to constrain whether they served as evidence, a quantifying score, or something else . . . . just that they somehow support the knowledge expressed in an Edge. e.g.
has supporting data
p-value
effect size
ic50
z-score
relative odds ratio
We have started to collect a set of data type we need to support in the spreadsheet here - and have code in place to automatically generate Biolink edge properties from this content.
Request: KPs and ARAs populate this spreadsheet with your required supporting data/score types, and clear definitions, so we can provide you with a standard Biolink property to use in your data.
The other piece of the puzzle is to provide end users with documentation about meaning/generation/utility of these data types - which we could provide in a structured way in the message, or link out to using the Attribute.value_url field. In any case, this will require modeling, content and infrastructure support.
Ultimately, we need the user to understand how a data value hanging form an Edge relates to the knowledge expressed in the Edge. And consider if we want to explore a different paradigm than the 'one edge property per data type' approach put forth initially.
The Translator / TRAPI landscape was very different nearly a year ago when we decided to create separate edge properties for each data type (see #7)). At the time, there was concern about the massive proliferation of edge properties this could lead to in the Biolink Model - but in the absence of precedent for or clear advantages of alternatives, we pushed ahead with this approach.
Things have changed since then however, with the introduction of new modeling approaches and paradigms that may have relevance to the representation of supporting data. Specifically:
We have the ability to arbitrarily nest attributes, and seen their use to capture additional information about a top level attribute, and/or logically organize supporting data items to enable clearer understanding of their provenance and significance (e.g. using Study Results)
There is precedent for creating custom structures for specific types of 'Attributes' a. e.g. 'Qualifier' objects b. e.g. 'RetrievalSource' objects to support retrieval provenance (see TRAPI #386)
We have formal support in Biolink for creating and using enumerations to constrain values of fields
We also have gained a better understanding of the diversity/complexity of supporting data information, and how may be used/useful in Translator. Specifically:
. . .if we simply have edge properties like 'has_p-value', 'has_z-score', 'has_ic50', etc., consider if / how we indicate the role a data item is playing when we attach it to an Association . . . to users interpret it correctly/in the right context
All of this didn't exist when we made our initial modeling decisions around supporting data representation. We should revisit this decision, and consider alternatives that may be more appropriate/relevant and now feasible to implement in the current landscape.
As of now, we have yet to implement more than a handful of specific supporting data properties in Biolink. The list below mainly reflects those created in response to specific KP requests:
General/test properties we implemented long ago
Slots created specifically for the COHD use case
Slots created specifically for the TMKP use case
But I think many KPs are still using bespoke / non-Biolink properties as Attribute keys, and/or using more general Biolink edge properties like has evidence
, or has numeric value
, and then indicating the value type elsewhere.
In the short term (for the September release) – should we go ahead and create dedicated data-type specific edge properties in Biolink? If so, how do we decide what properties to create . . . the spreadsheet linked in the comment above is likely way out of date. Can we get a new set of KP requests?
A few open questions re defining these properties:
supporting data
parent property?)has evidence
property?If we want to continue down the 'one edge property per data type' approach - we need to catalog and create these edge properties, give them clear definitions, and provide examples/documentation that is accessible to users from the UI - so it is clear how a data value hanging form an Edge relates to the knowledge expressed in the Edge.
However, IMO we should consider if we want to explore a different paradigm that do not require creation of potentially hundreds of datatype-specific edge properties in the Biolink Model. e.g.:
At present the Biolink Model provides only a handful of data type-specific edge properties for use by KPs (e.g.
p value
,chi squared statistic
,concept pair count
,expected count
,ln ratio
,ln ratio confidence interval
, ...).In practice, when a specific property is not available for the type of data/score a KP wants to provide, they do one of three things:
has evidence
, 'has numeric value
), and indicate the more specific type of value the attribute holds using an ontology term in the Attribute.value_type_id
field. (which violates the intended use of this field to hold a more foundational data type, e.g. curie, url, float, etc. - see https://github.com/NCATSTranslator/ReasonerAPI/issues/454)