TranslatorSRI / SRI_testing

MIT License
0 stars 1 forks source link

Specify Biolink Model 3 (qualifier aware) KP test data format #60

Open RichardBruskiewich opened 1 year ago

RichardBruskiewich commented 1 year ago

Biolink Model 3.0 introduced a new 'qualifier' framework for knowledge representation and TRAPI query. This new framework will require additional KP test edge information for testing, thus, a new KP test data format is needed, perhaps more closely mirroring the TRAPI 1.3 support for this new semantic space. We also need to consider filtering on existing attribute content. Further ahead, constraining test edges to match a specified subclass of biolink:Association may facilitate validation, including qualifier (secondary statement level) validation.

RichardBruskiewich commented 1 year ago

The biolink model sensitive portions of the current KP JSON test configuration file are the test edge data that look something like this:

{
...
    "edges": [
       {
            "subject_category": "biolink:SmallMolecule",  
            "object_category": "biolink:Disease",
            "predicate": "biolink:treats",
            "subject": "CHEBI:3002",     # beclomethasone dipropionate
            "object": "MESH:D001249"     # asthma
        },
        ...other test edges
   ]
}

The general strategy to fix the format to align with Biolink Model 3.0++ is perhaps to ensure that test edge data closely mirror the corresponding TRAPI 1.3++ schema data model:

  1. Specify subject and object categories: already specified by "subject_category" and "object_category" json object keys respectively.
  2. Specify subject and object identifiers: already specified by "subject" and "object" json object keys respectively. Would it be reasonable, however, to rename these to "subject_id" and "object_id" to be more precisely consistent with the actual naming of the corresponding TRAPI fields?
  3. Specify filters on edge qualifiers constraints: this a new test data constraint we need to add, although optional, since they may not be needed or supported by all KP's or needed for most knowledge graphs returned by TRAPI. To implement this, we can look at the TRAPI schemata (see Qualifier Constraints below).
  4. Specify filters edge attributes constraints: the SRI Testing harness already attempts generic validation on attributes with a special focus on knowledge source provenance constraints; however, the current test edge data doesn't explicitly filter on any other attribute content. To implement this (as an optional filter), we can look at the TRAPI schemata (see Attribute Constraints below)
  5. Specify biolink:association-based class constraint: although still to be fully discussed with the TRAPI Working Group and possibly hosted in TRAPI 1.4 (not 1.3), the possibility may be made of attempting a "soft" introduction leveraging TRAPI 1.3 Attribute Constraints. The main rationale here is that knowing what association category is expected can provide solid guidance for semantic validation of the edge, for example, the range (or subproperty space) of acceptable values for its subject category, predicate and object category, and possibly, what edge qualifiers are expected or appropriate for the edge (see Association Constraint below).

Qualifier Constraints

The TRAPI model (excerpt) for query graph (QEdge) qualifier specification is:

    QEdge:
      type: object
...
      properties:
...
        qualifier_constraints:
          type: array
          description: >-
            A list of QualifierConstraints that provide nuance to the QEdge.
            If multiple QualifierConstraints are provided, there is an OR
            relationship between them. If the QEdge has multiple
            predicates or if the QNodes that correspond to the subject or
            object of this QEdge have multiple categories or multiple
            curies, then qualifier_constraints MUST NOT be specified
            because these complex use cases are not supported at this time.
          items:
            $ref: '#/components/schemas/QualifierConstraint'
          default: []

where QualifierConstraint is:

    QualifierConstraint:
      additionalProperties: false
      description: >-
...
      properties:
        qualifier_set:
          type: array
          description: >-
            A set of Qualifiers that serves to add nuance to a query,
            by constraining allowed values held by Qualifiers
            on queried Edges.
          items:
            $ref: '#/components/schemas/Qualifier'
          nullable: false

and Qualifier is:

    Qualifier:
      additionalProperties: false
      description: >-
        An additional nuance attached to an assertion
      type: object
      properties:
        qualifier_type_id:
          type: string
          description: >-
...
          nullable: false
        qualifier_value:
          type: string
          description: >-
...
          nullable: false
      required:
        - qualifier_type_id
        - qualifier_value

This suggests one direct specification of qualifiers in the test data edge might look something like this:

"qualifier_constraints": [
   {
        "qualifier_set": [
              {
                   "qualifier_type_id": "biolink:causal_mechanism_qualifier"
                   "qualifier_value": "agonist"
              },
              {
                   "qualifier_type_id": "biolink:catalyst_qualifier"
                   "qualifier_value": "CHEBI:63951"   # estrogen receptor agonist
              }
        ]
   },
...other qualifier constraint sets?
]

Assuming (perhaps foolishly?) that qualifiers can only be specified once in a query, then perhaps the following alternate direct compact representation of the qualifier_type_id = qualifier_value pair could be envisioned:

"qualifier_constraints": {
        "biolink:causal_mechanism_qualifier":  "inhibition",
...other qualifier type_id:value pairs?
  }

But this doesn't allow for duplicate keys, so perhaps we use the full TRAPI syntax for a single Qualifier entry.

On the other hand, we pose the question whether or not test edge input data ought (or need) to have complex multi-qualifier constraints, namely, arrays of multiple QualifierConstraint qualifier_set objects, each containing an array with multiple Qualifier objects. It is an open question about how complex the test edge should be allowed to be, i.e. how many qualifier_set's are specified. Expectations should be realistic here among test specifiers!

An initial implementation of test edge data format supporting qualifiers could limit itself to one qualifier, for example:

    "edges": [
        {
            "subject_category": etc...
            "qualifier": {
                      "qualifier_type_id": "biolink:causal_mechanism_qualifier"
                      "qualifier_value": "inhibition"
            }
        }  # end of test edge spec
... more test edges
]

A second generation implementation of test edge data format supporting qualifiers could allow a single qualifier_set, for example:

    "edges": [
       {
            "subject_category": etc...
            "qualifier_set": [
                  {
                       "qualifier_type_id": "biolink:causal_mechanism_qualifier"
                       "qualifier_value": "inhibition"
                  },
                  {
                       "qualifier_type_id": "biolink:causal_mechanism_qualifier"
                       "qualifier_value": "inhibition"
                  }
            ]
     }  # end of test edge spec
... more test edges
]

A third iteration, maybe additional qualifier sets:

    "edges": [
       {
            "subject_category": etc...
            "qualifier_constraints": [
                  {
                     "qualifier_set": [
                           {
                                "qualifier_type_id": "biolink:causal_mechanism_qualifier"
                                "qualifier_value": "agonist"
                            },
                            {
                                "qualifier_type_id": "biolink:catalyst_qualifier"
                                "qualifier_value": "CHEBI:63951"   # estrogen receptor agonist
                             }
                      ]
                 },
                 ... more qualifier sets
            ]
     }  # end of test edge spec
... more test edges
]

Attribute Constraints

Although QNodes can also have attribute constraints, to use cases for validation are perhaps less compelling than edge filtering, so for now, we focus on the TRAPI AttributeConstraint schema for query graph (QEdge):

    QEdge:
      type: object
...
      properties:
...
        attribute_constraints:
          type: array
          description: >-
            A list of attribute constraints applied to a query edge.
            If there are multiple items, they must all be true (equivalent
            to AND)
          items:
            $ref: '#/components/schemas/AttributeConstraint'
          default: []

where AttributeConstraint is:

    AttributeConstraint:
      type: object
      description: >-
        Generic query constraint for a query node or query edge
      properties:
        id:
          allOf:
            - $ref: '#/components/schemas/CURIE'
          description: >-
            CURIE of the concept being constrained. For properties
            defined by the Biolink model this SHOULD be a biolink CURIE.
            otherwise, if possible, from the EDAM ontology. If a suitable
            CURIE does not exist, enter a descriptive phrase here and
            submit the new type for consideration by the appropriate
            authority.
          example: EDAM:data_0844
        name:
          type: string
          description: >-
            Human-readable name or label for the constraint concept.
            If appropriate, it SHOULD be the term name of the CURIE used
            as the 'id'. This is redundant but required for human
            readability.
          example: molecular mass
        not:
          type: boolean
          default: false
        operator:
          type: string
          description: >-
            ... see TRAPI for full definition
        value:
          example: 57.0
          description: >-
            Value of the attribute. May be any data type, including a list.
            If the value is a list and there are multiple items, at least one
            comparison must be true (equivalent to OR) unless the '==='
            operator is used. If 'value' is of data
            type 'object', the keys of the object MAY be treated as a list.
            A 'list' data type paired with the '>' or '<' operators will
            encode extraneous comparisons, but this is permitted as it is in
            SQL and other languages.
        unit_id:
          example: UO:0000222
          description: >-
...
          nullable: true
        unit_name:
          example: kilodalton
          description: >-
...
          nullable: true
      required:
        - name
        - id
        - operator
        - value
      additionalProperties: false

The AttributeConstraint schema has less depth than QualifierConstraints but obviously more degrees of freedom at the bottome in that values are returned relative to a specific operator with relatively free ranging value data (types). Alas, attribute constraint model indicates that name, id, operator and value are all required.

SRI Testing harness validation generates various permutations of TRAPI test queries based on test edge data. As such, it may need to embrace a more modest objective of supporting a small number of high value use cases (one of which the system has already has, in the focused validation of string knowledge source provenance data).

One new use case may be the validation of Association Constraint below).

Other explicit use cases could initially limit validation to a subset of use cases, for example, to implicitly always assume use of the operator value '===' on simple scalar (e.g. string, numeric) data.

This suggests an initial specification of attribute constraints in the test data edge might look something like this:

"attribute_constraints": [
   {
          "id": "<attribute_type_id>"
           "value": "<value>"
   },
...other attribute constraint id/value pairs?
]

Association Constraint

Although TRAPI querying of a biolink:Association constraint may initially need to be specified as an attribute constraint (see above), there is nothing that prevents us here from adding a specific 'association' tag to the test edge data JSON (see below).

Summary

Elaborating our original test data example above, the newly formatted test edge data might finally look something like this:

{
...
    "edges": [
       {
            "subject_category": "biolink:SmallMolecule",  
            "object_category": "biolink:Disease",
            "predicate": "biolink:treats",
            "subject_id": "CHEBI:3002",     # beclomethasone dipropionate
            "object_id": "MESH:D001249"     # asthma
            "association": "biolink:ChemicalToDiseaseOrPhenotypicFeatureAssociation",
            "qualifier_constraints": [
                 {
                      "qualifier_type_id": "biolink:causal_mechanism_qualifier"
                      "qualifier_value": "inhibition"
                 },
                 ...other qualifier constraint type_id/value pairs?
             ],
            "attribute_constraints": [
                 {
                      "id": "<attribute_type_id>"
                      "value": "<value>"
                 },
                 ...other qualifier constraint type_id/value pairs
             ]

        },
        ...other test edges
   ]
}
cbizon commented 1 year ago

I think that the basic structure is correct. You really want to allow for all the qualifiers for that edge to be in the doc here. (in other words the part that says "... other qualifier constraint type_id/value pairs?" needs to exist. But that's not complicated, it just means that there are more entries in that list.

The attribute_constraints are useful only if we want to write tests querying by attribute. We do have that capability in TRAPI so maybe it makes sense. But we maybe should think about what tests we might expect to run there before we have everybody implement that?

cbizon commented 1 year ago

I guess one other thing- why do we have association in here? We (ranking agent) don't model it explicitly.

colleenXu commented 1 year ago

I agree with @cbizon, the "Association" stuff in biolink-model (with children like ChemicalAffectsGeneAssociation) isn't used at all by BTE / Service Provider.

Also, I'm confused because I thought the test-triples were meant to be similar to Edges, not QEdges. So the test-triples would have qualifiers and attributes (not qualifier_constraints and attribute_constraints).

So Service Provider has a link to a "qualifiers" test-triple file right now.

sierra-moxon commented 1 year ago

With the test harness, we want to be able to verify that all the properties of an edge (including the nodes that are connected by it) are valid according to the model. The Association objects in the model help us constrain the properties of an edge, including the subject and object domain/range, the predicate, and the kinds of qualifiers that are expected. It would be really handy to have the association name to lookup and test against for each statement.

An alternative to supplying the association in the test data could be that the testing harness tries to match the incoming test statements against all possible associations in the model, and throws an "invalid statement" error if it can't one. But I am not sure that would be very performant against a large set of test data? And the "invalid statement" error might not be as specific as we could make it if we had the association -- for example, if the association match 80%, we could report which qualifiers or changes would be needed to make it a valid association.

RichardBruskiewich commented 1 year ago

We can start by making the 'association' category optional, assuming only biolink:Association and constraining the testing accordingly (which is, of course, "not much").

RichardBruskiewich commented 1 year ago

Also, I'm confused because I thought the test-triples were meant to be similar to Edges, not QEdges. So the test-triples would have qualifiers and attributes (not qualifier_constraints and attribute_constraints).

In an absolute sense, @colleenXu, you are correct that input test data ought to be considered 'real' edges, not query edges.

There's a messy issue hiding behind the test edges files and that is that TRAPI knowledge graphs cannot have abstract or mixin categories and predicates, but in fact, some KPs (no names will be mentioned) are giving such categories and predicates in their test data, which works fine in the TRAPI Query Graph (it's allowed) but I suspect that is cheating a bit, since the TRAPI Knowledge Graph cannot actually ever return such categories and predicates, so in a manner of speaking, such tests ought not to pass!

In fact, the SRI Test Harness can insist on "strict" validation in which case, use of such abstract or mixin category and predicate values in the input data would trigger an error...

As for the use of 'qualifiers' versus 'qualifier_constraints', in fact the test edges would initially used as query graph values (i.e. in as qualifiers constraints) but yes, one might suppose in the validation itself, they should simply be consider 'qualifiers' for direction comparison in the TRAPI response. There is no harm in simplifying the terminology in this fashion.

RichardBruskiewich commented 1 year ago

I guess one basic comment I'd have on some of the above, is that most of the new extra stuff is 'optional'. Whether or not KPs implement any form of it at a given point in time will be purely pragmatic. In such a situation, they can use the (optional) extensions to queries.

I don't have a strong opinion about attribute constraints, except that our current treatment of them in the SRI Testing harness is fairly minimalist: mainly some tests for knowledge_source annotation consistency (i.e provenance). Not much else. If use of the constructs are optional, we can probably extend testing on a compelling case-by-case basis (much like the qualifiers starting with 'drug-gene' relationships...)

cbizon commented 1 year ago

@sierra-moxon the problem with including associations is that none (?) of the tools actually use them, so that the inference problem you describe would now have to be done by everybody simultaneously to create the test data.

@RichardBruskiewich I think that there's some practical disagreement about whether Knodes can be mixins or abstract entities. It would be nice if we could have a rule, but I can think of a couple of cases where mixin or abstract is the best one can do (I can elaborate if need be).

RichardBruskiewich commented 1 year ago

@sierra-moxon the problem with including associations is that none (?) of the tools actually use them, so that the inference problem you describe would now have to be done by everybody simultaneously to create the test data.

Like @sierra-moxon, I sense that the onset of statement qualifiers have made a significantly stronger case for tagging edges with their biolink:Association (child) category. We know that nobody really uses them yet, but one could say the same about qualifiers!

As I hinted above, we could make the association tagging of test edge data optional to start (or perhaps, indefinitely) but if provided by a few hardy pioneering KP's, we could leverage it a bit to test ideas about how it may facilitate validation.

One can imaging tagging edges in one (or both) of two ways:

  1. statically, in the underlying knowledge graphs or
  2. dynamically, in the fly using relatively simple heuristics

Furthermore, one can imagine that the more specialized KPs may often (although not always) only have a small handful of applicable biolink:Association (child) categories that apply to their knowledge, making the tagging less onerous than it first seems, thus, the dynamic tagging may not be too difficult to implement (rather than marking up all of the underlying knowledge graph, where KPs maintain such graph databases - e.g. Neo4j etc.? - internally)

The biolink:Association category tagging in the Biolink Model has been around quite awhile but as a second class citizen of the model. I'd be curious to hear from @cmungall about his perspective on this... there seems to be a helpful potential role in facilitating constraint, validation and interpretation of knowledge inherent in these edge category tags, even if we are not yet significantly leveraging them.

RichardBruskiewich commented 1 year ago

@RichardBruskiewich I think that there's some practical disagreement about whether Knodes can be mixins or abstract entities. It would be nice if we could have a rule, but I can think of a couple of cases where mixin or abstract is the best one can do (I can elaborate if need be).

This discussion sounds a bit above my pay grade - maybe the Architecture, TRAPI WG and/or Data Modelling meetings can vote on this. SRI Testing can encode whatever policy applies.

With respect to abstract and mixin validation, SRI Testing currently defaults to "QGraph - Yes, KGraph - No" although there is a 'strict_validation' boolean flag which overrides this either way.

Colleen's observation about test edges being KGraph not QGraph relates a bit to the outcome of this decision in that - as you know - the test edges are fed verbatim (in difference combinatorial ways) into the QGraph of TRAPI requests, then (partly) used to assess the TRAPI response (although the main focus there is focused more on TRAPI schema and Biolink Model compliance of request KGraph JSON contents, but not as much, the Result mappings in between QGraph and KGraph - more needs to be done there (some already on the issue tickets "TODO" list)

colleenXu commented 1 year ago

I agree with Chris Bizon's comments here, but to add...

sierra-moxon commented 1 year ago

makes sense: please put in tickets/ PRs if possible to biolink - I am happy to make changes as needed. (for example, I just moved macromolecular complex from a mixin to a regular element in the model as it fit best with a use case that Chris B brought to us). Sometimes it just takes some reworking in the hierarchy.

RichardBruskiewich commented 1 year ago

Generally speaking, abstract and mixin category (class) and predicate (slot) definitions are not meant to be instantiated. Abstract classes/slots are generally found within an 'is_a' hierarchy; mixins, only injected under a 'mixins' list in a given class/slot.

As Sierra indicates, is probably better to make abstract and mixin elements 'concrete' if they are routinely embedded in knowledge graph nodes and edges.

RichardBruskiewich commented 1 year ago

Hello @cbizon, @colleenXu and @sierra-moxon (cc: @putmantime), I am of the impression that we need to get an initial release of the revised KP test edge data format out the door (as early as the Translator Architectural Committee meeting tomorrow, 29 November 2022), thus in the interests of this goal, here are my thoughts:

  1. On the issue of biolink:Association class hierarchy annotations, I think @colleenXu's suggestion that they be permitted in the KP test edge data, but not (yet) required to be published in the TRAPI (hence knowledge sources) is actually a helpful step forward (@sierra-moxon, would you concur) in that the KP curators can already start to signal the implied knowledge constraints to the SRI Testing (a.k.a. Biolink Model Toolkit) in validation while minimizing back end (knowledge graph) impact.
  2. On the issue of abstract and mixin category (class) and predicate (slot) elements, I am wondering to myself that if such elements appear int the input test edge data, perhaps the SRI Testing framework can take note of their presence (via "information" messages) but also, simultaneously keep track of them downstream when the knowledge graph compliance to Biolink Model standards is assessed, not treating them as errors but perhaps, just as a 'warning' (or again, an 'information') message? The only tricky aspect one might foresee here is whether or not the SRI Testing somehow expect and allow concrete children of the given (Q graph) abstract and mixin category (class) and predicate (slot) elements to legitimately appear as answers in the knowledge graph, in lieu of the original Q graph element values? This may require a bit more clever Biolink lookup and validation coding.
  3. On @colleenXu's other observation: "_...Also, I'm confused because I thought the test-triples were meant to be similar to Edges, not QEdges. So the test-triples would have (optional) qualifier and attribute (not qualifier_constraints and attribute_constraints)..._", I think treating the input test edge data as specific instances of qualifier and attribute terms against which validation ought to occur, is not without merit (again, @sierra-moxon, please kindly comment). I am therefore not adverse to renaming these fields in the revised format to those.
  4. On @cbizon commentary that we don't really yet emphasize attribute constraints in most Translator resources, we can kick the can down the road to a later iteration of the SRI Testing, based on feedback from the Translator community. So, I can remove them from this revision of the KP test data specification.

Thus, the revised proposal for tomorrow's meeting would be the follow:

{
... <other necessary test data file fields>...
    "edges": [
       {
            "subject_category": "biolink:SmallMolecule",  
            "object_category": "biolink:Disease",
            "predicate": "biolink:treats",
            "subject_id": "CHEBI:3002",     # beclomethasone dipropionate
            "object_id": "MESH:D001249"     # asthma
            "association": "biolink:ChemicalToDiseaseOrPhenotypicFeatureAssociation",
            "qualifiers": [
                 {
                      "qualifier_type_id": "biolink:causal_mechanism_qualifier"
                      "qualifier_value": "inhibition"
                 },
                 ...other qualifier constraint type_id/value pairs?
             ]
        },
        ...other test edges
   ]
}
colleenXu commented 1 year ago

I'm fine 👍 with this.

Only 1 small note: is that the format we (@sierra-moxon, biolink-model folks) want for "association" - a biolink-prefix and snake_case? I'm good with using any case, particularly if one format is easier to ingest / parse...

Previously, when we've discussed Associations in comments, we've used PascalCase (ChemicalAffectsGeneAssociation) with no biolink-prefix.

sierra-moxon commented 1 year ago

good catch @colleenXu - associations are classes in the model, so it would be good to use the PascalCase.

RichardBruskiewich commented 1 year ago

good catch @colleenXu - associations are classes in the model, so it would be good to use the PascalCase.

Fixed. Thanks @colleenXu...

I guess we're ready to present this tomorrow. I'll make a PR of the SRI Testing README out of this.