Feature requests for MetaKG endpoints

mbrush commented 2 years ago

The metaKG endpoints provided by each KP report what 'edge types' meta-edges) the KP provides (an 'edge type' is defined by a unique combination of subject category, object category, and predicate). Service Provider collects and standardizes these reports to provide a comprehensive 'MetaKG' (capital 'M') endpoint that spans all of Translator. This resource has the potential to support a variety of important use cases that will facilitate Translator development and use:

1. Query Federation and Knowledge Discovery: This is why metaKGs were created in the first place - to help ARAs know what KPs have what types of knowledge, so they can send queries to right ones, and find specific kinds of knowledge they need for analysis/reasoning tasks and workflows.

2. QA efforts: help identify metaedge types that violate domain/range constraints (e.g. assert things like 'Small Molecule has_input Protein'), or are scientifically invalid/nonsensical (e.g. 'ChemicalEntity predisposes Gene')- and report back to owning KP to fix

3. Guide Modeling Work and Priorities: As the modeling team expands support for new knowledge types, and refactors existing representations to accommodate qualifier-based representations, we need to understand what KPs are producing. This will help us: a. prioritize what subdomains are highest priority b. ensure modeling proposals cover all data/use cases c. consult relevant KPs for input and feedback. d. work with KPs to help them migrate their data to new structures.

4. Data Gap Analysis: a comprehensive metaKG will allow us to survey what types of knowledge our resources cover. We can then identify areas where we may be missing sufficient data to support priority queries/use cases.

5. End User Documentation: the information in the metaKG can help us produce summary statistics about the types and abundance of knowledge Translator holds, which can help end users determine how it can be useful, and devise strategies for applying it to their questions/use cases.

The Biolink/SRI team has identified a few new features of the MetaKG endpoint/model that improve support for these use cases (in particular, supporting our modeling work as we explore refactoring to a qualifier-based approach).

1. Support for including what subject/object and statement level qualifiers are used with a given SPO edge type (and the values of the qualifiers that are used).
a. With the ongoing refactor, there will be a significant reduction in number/granularity of predicates - and much of the semantics of the model/data will move from predicates to qualifiers.
b. Specifying a 'type' of edge/statement at the granularity/resolution currently provided by metaKGs will now require more than indicating the subject and object categories and predicate used - qualifier types and values will need to be reported as well to express the full semantics of an /edge type'. This will require an expansion of the 'columns' in the metakg report/model - and a means to extract, summarize, and report qualifier values from KP edges. c. @edeutsch and/or @vdancik may have drafted a proposal for this somewhere - please link here.

2. Include a list of sources that provide a given edge type (could simply be a list of infores curies).
a. Minimally we would need to see a list of Translator KPs that provided each reported edge type. But if possible we would like to see a list of primary sources as well. b. One reason for this is to help report and fix invalid edge types back to the KPs that are providing them (e.g. in the Chem Gene space almost half of the 1400 edge types described in metaKG results are semantically invalid in that they violate domain/range constraints (e.g. assert things like 'Small Molecule has_input Protein'), or are scientifically invalid/nonsensical (e.g. 'ChemicalEntity predisposes Gene')

3. Include a count of how many instances of a reported Edge type exist in the data a. Helps us prioritize which data/sources to address, and which problems to fix first.

4. Include a list of attribute types found across all instances of a given edge types

5. Report on edge types 'rolled up' to specific higher level Node Categories a. e.g. show me Chemical Entity-Gene or Gene Product edge types, but report at level of Chemical Entity and Gene or Gene Product. For cases where we don't need to see rows for each predicate for every combination of subtypes of these node categories b. May be best to implement some requirements via post-processing -maybe load results of MetaKG dump into solr - so can do roll up based on hierarchies of categories, etc.

Tagging @sierra-moxon and @edeutsch

sierra-moxon commented 2 years ago

Similarly, I noticed in the current specification, that MetaNode attributes are optional.
https://github.com/NCATSTranslator/ReasonerAPI/blob/91338a117354385d58ce0b904fb70a095fc501dd/TranslatorReasonerAPI.yaml#L977.

and with MetaEdge.attributes: https://github.com/NCATSTranslator/ReasonerAPI/blob/91338a117354385d58ce0b904fb70a095fc501dd/TranslatorReasonerAPI.yaml#L1018

It would be really helpful for the modeling team to see the kinds of attributes that are assigned for both nodes and edges. It is probably hard to make these required (I can imagine use cases where there are no attributes). Do we interpret the lack of attributes returned as "no attributes available"?

RichardBruskiewich commented 2 years ago

Would resolution of this use case also consider the formal publication of the category of an Edge (edges do have a category argument for some time now, which is generally set to the specific child subclass of biolink:Association which constrains the overal semantics of the statement - I think more or less what @mbrush is calling 'edge type' above

?). Such Edge category values may prove invaluable in managing (and validating) the emerging world of Biolink Model 3.0 qualifiers.

NCATSTranslator / ReasonerAPI

Feature requests for MetaKG endpoints #342