NCATSTranslator / TranslatorArchitecture

MIT License
9 stars 11 forks source link

TranslatorArchitecture

Process

This repository tracks the decision making for the Translator architecture.

This README documents the current strawman architecture. Changes must be made via pull requests. Questions or discussion around a topic that is not easily related to a specific pull request occurs in github issues.

Definitions

Architecture Principles

  1. The goal is to create a single integrated product from federated services and data
  2. Which components communicate with one another?
    1. ARS broadcasts query (Message) to one or more ARAs
    2. ARAs respond to ARS with Message
    3. ARA sends query messages to KPs
    4. KPs respond to ARAs with Message
  3. Interfaces:
    1. All communication between the ARS and ARAs conforms to the ReasonerAPI Message spec
    2. KP can expose their information using these methods:
      1. ReasonerAPI Message
      2. Any SmartAPI-annotated interface
      3. A file dump conforming to KGX standards
    3. The Translator consortium will develop tools to automatically
      1. proxy ReasonerAPI calls to smartAPI calls and
      2. deploy ReasonerAPIs of KGX file dumps
    4. Subsequent requirements on KPs in this document will specify their application to ReasonerAPI, SmartAPI, and/or KGX interfaces.
  4. Entities in any ReasonerAPI message (ARS/ARA or ARA/KP) or KGX file-based communication are represented using compact URIs (CURIES), which must be expandable to full IRIs using a biolink-model provided json-ld context file. Entities returned from a non-ReasonerAPI smartAPI-registered KP must provide sufficient information in the registry to allow an automated conversion of the entity identifier to a biolink-model CURIE.
  5. Node Identifiers
    1. KPs must expose machine readable information about the types of node identifiers that they consume and produce.
    2. ARAs or other integration tools such as KGX will perform node identifier equivalence translations.
    3. The consortium will produce or adopt equivalent id sets, which will be shared across Translator tools. Multiple Translator teams will contribute expertise to these sets, but that expertise will produce centralized results.
    4. SRI will provide tools for disseminating these equivalent identifiers, drawing on the prior work of multiple Translator teams.
    5. The ARS will use these tools to normalize identifiers coming from ARAs before merging results. Normalization will be performed with conflation. Unconflated normalization may be implemented in the future, but it will require a query-time specification of conflation status, which will be passed to the ARAs.
  6. Node Properties
    1. The consortium will produce or adopt a set of node properties. The semantics of these properties will be defined in the biolink model. Multiple Translator teams will contribute expertise to these properties, but that expertise will produce centralized results.
    2. The consortium will provide a central tool for disseminating these node properties.
    3. The ARS will call these tools to provide a consistent set of node properties to the UI and other clients.
  7. Edge Predicates
    1. Relationships between entities (edges) have a predicate indicating the specific type of relationship between the entities.
    2. The biolink model will contain a set of predicates (biolink predicates) used to bridge across pre-existing predicate vocabularies
    3. The biolink model will designate a set of such vocabularies that can be mapped to biolink predicates. These vocabularies are called biolink-mapped.
    4. Predicates in ReasonerAPI messages and KGX files must be biolink predicates.
    5. Responses from non-ReasonerAPI smartAPI-registered KP must provide sufficient information via the registry that clients can determine the predicate as an identifier from a biolink-mapped vocabulary.
    6. As a best practice, KPs should map ingested predicates to a biolink-mapped vocabulary as precisely as possible, and rely on tools to convert these predicates into biolink predicates.
    7. The SRI will provide mapping tools to perform this conversion.
  8. ARAs and KPs may both score answers (provide scores in the message); ARAs are required to score answers.
  9. KPs should not call other KPs.
  10. KPs that implement the Translator Reasoner API must perform the following kinds of reasoning in answering queries:
    1. Making identifiers more specific, e.g. responding to a query involving an entity with information related to a subclass of that entity. In the knowledge_graph portion of the response, the more-specific identifier must be present and linked to the less-specific identifier. In the results portion of the response, the more-specific response node will be bound to the less-specific query node.
    2. Making categories in a query more specific. e.g. responding to a query for a biolink:NamedThing with a particular biolink:ChemicalSubstance.
    3. Making predicates more specific, e.g. responding to a query for “affects expression of” with an edge with predicate “increases expression of”. In the response, the more specific edge must occur in the knowledge_graph portion of the response, and in individual results, that more specific edge will be bound to the less specific query edge. Query Graph and Knowledge Graph edges need not match in either predicate or direction to be bound in an answer.
    4. Inverting symmetric predicates, e.g. if the KP contains information that A and B are correlated, then it should respond with that information whether the query is asked in the form A-[correlated_with]->B or B-[correlated_with]->A.
  11. Query Modes:
    1. As described in the TRAPI specification, edges may be queried in either "lookup" or "inferred" mode.
    2. KPs and ARAs must respond to lookup queries by treating the query as an exact database match
    3. ARAs may respond to inferred mode one-hops with relevant results beyond an exact database match; KPs may also provide this capability
    4. When answering an inferred-mode query, a component must also include lookup results.
    5. inferred-mode queries must be one-hops with a single predicate.
  12. ReasonerAPI best practices:
    1. When an ARA obtains multiple edges with the same subject, predicate, qualifiers, object, and original/primary source from KPs, it should represent these as a single edge in the knowledge_graph component of a ReasonerAPI message.
    2. An ARA or a KP must not combine edges unless they contain the same subject, predicate, qualifiers, object, and original/primary source.
    3. An ARA result is defined by the bindings of knowledge graph nodes to input query graph nodes. For a given set of node bindings, there can be only a single result. Two separate results MUST NOT differ only in their edge bindings, with the same set of node bindings.
  13. ARAs obtain biomedical data only via KPs (or other ARAs), not from locally-cached aggregated graphs or non-Translator data sources.
  14. Aggregated graphs must be created at the consortium level and exposed as a KP.
  15. Components that do not fulfill the responsibilities of KPs and ARAs can still be stand-alone elements of the architecture to provide particular functionality; such tools will use the Translator ReasonerAPI whenever possible.
  16. Answer persistence will be the responsibility of the ARS.
  17. A system-wide UI will (eventually) exist, and will allow users to interpret answers, and reformulate questions.
  18. The SmartAPI registry will serve as a Translator Registry, and will expose programmatically accessible metadata about KPs and ARAs.
    1. All REST-Style SmartAPI KPs must be registered in the Translator Registry.
    2. All Translator Reasoner API KPs must be registered in the Translator Registry. All metadata for Translator Reasoner APIs must be available via endpoints at the service, from which it will be extracted by the SmartAPI Registry.
    3. All KGX files intended for graph transfer must be registered in the Translator Registry. All metadata for KGX files must be contained in associated metadata files and exposed via an API, which will be consumed by the SmartAPI Registry.
    4. All ARA must be registered in the Translator Registry. The ARS will not require a separate registration.
    5. Each type of component must provide the metadata described here
    6. Non-KP, Non-ARA components, such as normalizers, must also be registered and provide metadata appropriate to their API type.
    7. The SmartAPI Registry will provide a unified query system, returning information about all three API methods. This query system will allow ARAs to locate the appropriate KPs.
    8. SRI will guarantee that metadata standards across the components allow such a unified query system.
    9. The SmartAPI registry will allow components to find all KPs by querying for biolink predicates. The SmartAPI registry will allow components to query by predicate from biolink-understood vocabularies, and return KPs that provide such metadata.
  19. A continuous integration framework will consume metadata from the registry, and provide automated testing and reports.
  20. Both KPs and ARAs should acquire and transmit provenance information to the fullest possible extent.
  21. When querying and returning results with predicates, KPs and ARAs must be queried using the 'canonical' predicate (as opposed to its inverse), and must return the 'canonical' predicate. There will be two ways to identify the 'canonical' predicate in the biolink-model: canonical translator predicates will not be tagged with the 'inverse:' attribute, and canonical predicates will be tagged with an "annotations" flag with the tag: "biolink:canonical_predicate" and value: "True". This principle also applies to KGX files and TRAPI messages.

    Diagram

image

image