implement edge attribute constraints

andrewsu commented 6 months ago

We originally proposed edge attribute constraints in the context of TRAPI 1.3 and https://github.com/biothings/biothings_explorer/issues/482, but breaking this out to its own ticket.

We have a solid use case for edge constraints proposed in this query template for the CQS: https://github.com/TranslatorSRI/CQS/blob/main/templates/mvp1-templates/mvp1-template4-bte-aeolus/mvp1-template5-service-provider-aeolus.json

The key bit is here, attempting to apply a minimum threshold on the biolink:evidence_count from AEOLUS.

    "message": {
        "query_graph": {
            "edges": {
                "e0": {
                    "predicates": [
                        "biolink:applied_to_treat"
                    ],
                    "subject": "n0",
                    "object": "n1",
                    "attribute_constraints": [
                        {
                         "id": "biolink:evidence_count",   
                         "operator": ">",
                         "value": 20            
                        }
                    ]
                }
            },
...

There are (at least) two issues that need to be done/checked:

Filtering of API responses: I'm assuming that it will be easier to do the edge attribute filtering after the subquery, rather than trying to adjust the subquery itself.
Aggregation of multiple values: In this slack message, @colleenXu pointed out a case where evidence_count is provided as a multi-element array (example below). In this case, I think it is reasonable to apply the constraint to the sum of the evidence_counts.

{
  "edges": {
    "1feea171db6394cfd9bcb20deae0ad9a": {
      "predicate": "biolink:applied_to_treat",
      "subject": "PUBCHEM.COMPOUND:3386",
      "object": "MONDO:0002050",
      "attributes": [
        {
          "attribute_type_id": "biolink:evidence_count",
          "value": [
            733,
            42
          ]
        }
      ],
      "sources": [
        {
          "resource_id": "infores:aeolus",
          "resource_role": "primary_knowledge_source"
        },
        {
          "resource_id": "infores:mychem-info",
          "resource_role": "aggregator_knowledge_source",
          "upstream_resource_ids": [
            "infores:aeolus"
          ]
        },
        {
          "resource_id": "infores:service-provider-trapi",
          "resource_role": "aggregator_knowledge_source",
          "upstream_resource_ids": [
            "infores:mychem-info"
          ]
        }
      ]
    }
  }
}

colleenXu commented 6 months ago

From my perspective, there's 3 issues at play here:

1: edge-attribute value type

BTE is currently returning this edge-attribute in Dev/CI instances (ref: commit).

However, the value type is currently an array of ints (click for examples).

These are from this example BTE response [show-edge-attribute-issue.json](https://github.com/biothings/biothings_explorer/files/14644742/show-edge-attribute-issue.json), which runs the POST version of [this query](https://mychem.info/v1/query?q=aeolus.unii:01K63SUP8D&jmespath=aeolus.indications%7C[?count%3E`20`]&fields=aeolus.indications,aeolus.unii) to MyChem Example of a 1-element array: ``` "dd9daae5b03bcad0698ff6669090f36b": { "predicate": "biolink:applied_to_treat", "subject": "PUBCHEM.COMPOUND:3386", "object": "MEDDRA:10070592", "attributes": [ { "attribute_type_id": "biolink:evidence_count", "value": [ 875 ] } ], ``` Example of a multi-element array: the [733 count from Depression and 42 count from "Depressed mood"](https://mychem.info/v1/query?q=aeolus.unii:01K63SUP8D&jmespath=aeolus.indications%7C[?count%3E`20`]&fields=aeolus.indications,aeolus.unii) were put in the same edge/edge-attribute since both meddra IDs mapped to the "MONDO:0002050 (depressive disorder)" entity. ``` "1feea171db6394cfd9bcb20deae0ad9a": { "predicate": "biolink:applied_to_treat", "subject": "PUBCHEM.COMPOUND:3386", "object": "MONDO:0002050", "attributes": [ { "attribute_type_id": "biolink:evidence_count", "value": [ 733, 42 ] } ], ```

I suggest flattening these into ints, because the array will probably cause validation issues (biolink-model says attribute values should be int) and it'll make the edge-attribute constraint easier to implement.

But we'll need to decide what to do with the multi-element arrays. These are happening because MyChem has separate meddra indication IDs, but BTE/NodeNorm maps them to the same entity. BTE then merges those records into the same edge, and concatenates the counts in the edge-attribute value. I think we could either:

add the counts from separate records together
create separate edges for different counts (add to hash?)

(Note: I'm not sure about flattening all 1-element arrays in edge-attributes. biolink:publications may be one example where we always want it to be an array, but we'd need to check with TRAPI folks first...)

2: what to do with the previous effort - a default, hard-coded count limit

EDIT: SEE UPDATE BELOW - we've implemented this.

We've been trying to add a hard-coded count limit of 20 to our MyChem queries https://github.com/biothings/biothings_explorer/issues/727#issuecomment-1782434986, similar to what we do with SEMMEDDB.

I was able to add it to the aeolusTreats operation (chem -> disease, commit), which all instances are using.

Old notes on reverse operation

But this hasn't been done for the reverse operation `aeolusTreats-rev` (disease -> chem), which is what creative-mode uses. In discussions last week ([three](https://suwulab.slack.com/archives/CC218TEKC/p1710450435216889) [Slack](https://suwulab.slack.com/archives/CC218TEKC/p1710451190238089) [links](https://suwulab.slack.com/archives/CC218TEKC/p1710451427180089)), we finally reached consensus on next steps: * by adjusting the x-bte annotation, I can get partway there. See this [commit](https://github.com/NCATS-Tangerine/translator-api-registry/commit/c09eeebf9c2f044e25ea16b2e228dffbb742e880) (special-reverses branch) * next is writing/implementing the BTE JQ-post-processing to remove the hits when the `aeolus.indications` field is empty. While this should be quick, I'm unsure of the logic to use and would need to discuss with Jackson... * something super-specific, that only works on responses from this operation? * generic-ish: "remove hits if this is a BioThings API AND supportBatch is false AND the scopes field specified in the request body (`aeolus.indications.meddra` in this case) isn't in the hit". * the "BioThings API only" and "supportBatch is false" should match "special reverses" - which are the only cases where we'd need this logic * I don't have other current x-bte annotation examples where this would be useful. I suspect that it may be useful for writing reverse x-bte operations for [MyChem chembl drug-mechanism](https://github.com/biothings/biothings_explorer/issues/779) and [drugcentral bioactivity](https://github.com/biothings/biothings_explorer/issues/780)

But if we want to implement TRAPI-query edge-attribute constraints, it's not clear if we want to go forward with this. An edge-attribute constraint < 20 would conflict with this hard-coded limit.

3: how to implement this issue's ask: TRAPI-query edge-attribute constraints

This is still up for discussion:

how generic/general we want our approach to be
how quickly we can do this
do we still want a default, hard-coded count limit for these MyChem aeolus indication operations (ex: when an edge-attribute constraint isn't specified)?

Idea: if an edge-attribute constraint is specified...

after running all sub-queries/building records, filter the records to only those that have the edge-attributes + the values meet the criteria (need to double-check that this is the TRAPI spec).
- pros: this seems to be the easiest to do (conceptually easy, less chance of bugs)
- cons: wasted effort getting records we'll later throw out
for BioThings APIs, transform the constraint into part of the query

I had another idea of transforming the constraint into part of the BioThings API query using the x-bte annotation templating and info in the response-mapping, but this would be complex and more effort to think through and implement.

colleenXu commented 6 months ago

During today's group meeting, we made decisions on issues (1) and (3) above:

3: how to implement TRAPI-query edge-attribute constraints

agreed to do this after retrieving the sub-query response (vs transforming the constraint into part of the sub-query API call)
two ways to do this: we picked "at/after edge-merging" because we want to keep all the counts that are getting "merged" and add them together
- VS applying the constraint at record creation time, which would throw out individual records where the count < limit. The benefit is having less records to work with in future steps

1: how to flatten multi-element arrays

agreed to add counts from multi-element arrays together to create int, without doing any steps to remove/throw out counts beforehand. This makes the most conceptual sense.
- use a config list of attribute-type-ids to control when we do this. For now, that list will just include this attribute-type-id (biolink:evidence_count)
then apply constraint to those sums

Jackson @tokebe estimated that this would take ~2 days of work. But as part of this effort, they'll review the TRAPI yaml/spec docs for QEdge.attribute_constraints expectations and requirements - and they'll decide how full/robust of an implementation to do for this issue.

colleenXu commented 6 months ago

Note that I've made a PR to ask about this template https://github.com/TranslatorSRI/CQS/pull/9:

tool queried (BTE vs Service-Provider-TRAPI)
attribute-constraint's value type - right now it's a string which is confusing
the attribute-type-id

colleenXu commented 6 months ago

Update! The template PR https://github.com/TranslatorSRI/CQS/pull/9 has been merged. So the current template is https://github.com/TranslatorSRI/CQS/blob/main/templates/mvp1-templates/mvp1-template4-bte-aeolus/mvp1-template5-service-provider-aeolus.json

Changes:

attribute-constraint value type is now a int
using Service-Provider-TRAPI, rather than BTE, then using a separate scoring service. Going to try this out.

colleenXu commented 5 months ago

Update on issue 2 above

The hard-coded/default/MyChem-query-level limit is now live in Dev/CI! See the details in https://github.com/biothings/biothings_explorer/issues/727#issuecomment-2046058611

colleenXu commented 3 months ago

And noting that Issue 1 was also addressed last month as part of https://github.com/biothings/biothings_explorer/issues/727#issuecomment-2091478525

Leaving just Issue 3 - the edge attribute constraint implementation itself (first post, later decision)

andrewsu commented 1 day ago

Going to add this to our agenda for next week to discuss. We tabled this earlier this year, and I'd like to revisit where we stand w.r.t. current priorities.

For updated info/context, BioLink Model undertook the "treats refactor", which separates out different types of evidence into different predicates (e.g. "in_clinical_trials_for", "beneficial_in_models_for"). CQS then created one hop templates that capture how to query translator with each of these predicates, and these queries often include edge constraints. (Though we could review those templates in more detail to look at what those edge constraints actually do and how common they are.)

Right now, our first MVP1 template queries the root of the treats predicate hierarchy (biolink:treats_or_applied_or_studied_to_treat). We should break each treats predicate out to different templates (because they have very different levels of confidence; see new issue #881), and then add in the CQS edge constraints once that functionality is available.

biothings / biothings_explorer