biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

create tests for non-canonical query topolgies #214

Open andrewsu opened 3 years ago

andrewsu commented 3 years ago

We should check whether BTE is correctly normalizing alternate representations of one query. For example I would expect that ChemicalSubstance[id:XXXX] - treats - Disease would return the same results as Disease - treated_by -> ChemicalSubstance[id:XXXX].

We should also test longer query paths, ie A[id:XXX] -> B -> C should be the same as A[id:XXXX] -> B <- C.

Assuming these BTE is using the expected logic to equate these various queries, we should create formal tests for these cases.

colleenXu commented 3 years ago

TDLR:


Comparing:

  1. "classic Predict": KCNMA1 (HGNC:6284) -(gene_associated_with_condition)-> Disease, then Disease -(condition_associated_with_gene)-> Gene (structure: A -> B -> C)
  2. "modified Predict for canonical predicate direction): KCNMA1-(gene_associated_with_condition)-> Disease, then Disease <-(gene_associated_with_condition)- Gene (structure: A -> B <- C).

Expected behavior:

These two scenarios should return the same knowledge_graph nodes / go through the same sub-queries. They would return different edges/results because the edges for the second hop (between B and C) would have different directions/predicates.

Current behavior:

Scenario 2 doesn't work as planned. It looks like this is because of the predicate restriction in the second step.

BTE does the querying correctly:

First: it queries from KCNMA1 -> Disease (set of IDs), then does predicate restriction to only keep nodes/edges/results with the predicate "gene_associated_with_condition". The logs for the predicate restriction are below:

        {
            "timestamp": "2021-06-24T16:12:18.959Z",
            "level": "DEBUG",
            "message": "query_graph_handler: Total number of results returned for this query is 344.",
            "code": null
        },
        {
            "timestamp": "2021-06-24T16:12:18.959Z",
            "level": "DEBUG",
            "message": "query_graph_handler: Successfully applied post-query predicate restriction with 42 results.",
            "code": null
        },

Second: it correctly flips the query edge from Gene <- filtered set of Disease IDs to filtered set of Disease IDs -> Gene. It then does that set of queries.

However, it looks like it then tries to restrict the answer edges (Disease -> Gene) with the predicate "gene_associated_with_condition" which none of them have. Instead some have the flipped/inverse predicate condition_associated_with_gene (like Query 1).

See the logs below:

        {
            "timestamp": "2021-06-24T16:13:41.941Z",
            "level": "DEBUG",
            "message": "query_graph_handler: Total number of results returned for this query is 73000.",
            "code": null
        },
        {
            "timestamp": "2021-06-24T16:13:41.941Z",
            "level": "DEBUG",
            "message": "query_graph_handler: Successfully applied post-query predicate restriction with 0 results.",
            "code": null
        }

So when BTE correctly flips an edge for querying, some fix is needed so the predicate restriction is then done correctly.

I could think of two ways to fix this:

  1. the predicate restriction uses the inverse predicate for filtering/restriction
    • [EDIT: andrew says we should be able to handle this even with the latest biolink. I think this involves checking, possible changes to the biolink JS package]
      • potential problem: the latest version of biolink model only lists inverses on 1 predicate of the pair. In that situation, I'm then not sure if BTE can recognize that gene_associated_with_condition is the inverse of condition_associated_with_gene AND vice versa.
    • then after this....BTE needs to flip/invert the predicates on these edges to match the query graph. I can't tell if it's doing it correctly right now
  2. the answer edges all need to flip predicates first, then predicate restriction using the query edge's predicate can be done.
    • potential problem: the problem with (2) is that this may involve changed logic for what order things are done in...

Query 2: the "modified" predict that doesn't work

{
    "message": {
        "query_graph": {
            "nodes": {
                 "n0": {
                    "ids": ["HGNC:6284"],
                    "categories":["biolink:Gene"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                },
                "n2": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:gene_associated_with_condition"]
                },
                "e1": {
                    "subject": "n2",
                    "object": "n1",
                    "predicates": ["biolink:gene_associated_with_condition"]
                }
            }
        }
    }
}

Query 1: "classic" predict that works

{
    "message": {
        "query_graph": {
            "nodes": {
                 "n0": {
                    "ids": ["HGNC:6284"],
                    "categories":["biolink:Gene"]
                },
                "n1": {
                    "categories": ["biolink:Disease"]
                },
                "n2": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": ["biolink:gene_associated_with_condition"]
                },
                "e1": {
                    "subject": "n1",
                    "object": "n2",
                    "predicates": ["biolink:condition_associated_with_gene"]
                }
            }
        }
    }
}
colleenXu commented 3 years ago

For @andrewsu 's note above:

ChemicalSubstance[id:XXXX] - treats - Disease would return the same results as Disease - treated_by -> ChemicalSubstance[id:XXXX].

1 hops seem to work fine right now. My tests were here. I can add another test for ChemicalSubstance treats Disease and vice versa if you would like.

See my review here

colleenXu commented 3 years ago

@andrewsu I've made a doc here for 1 and 2 hop Predicts and Explain. I've tested all of the queries with the latest query handling code and I think it works as-expected.

https://github.com/biothings/biothings_explorer/blob/query-directions/docs/tests-query-directions.md

I think the next step is deciding how to turn these into tests that aren't brittle (dependent on specific APIs / configurations...)

colleenXu commented 2 years ago

rebased the branch to match latest main, moved and updated the file: https://github.com/biothings/BioThings_Explorer_TRAPI/commits/query-directions

next time try merging to main rather than rebasing...

colleenXu commented 2 years ago

FYI to @tokebe. This is the issue that involves testing QEdge directionality.

Although it also involves testing more than that maybe....

andrewsu commented 1 year ago

To summarize current status here, Colleen created a "query-directions" branch that has a markdown file with various queries that are slightly different in terms of directionality. Behavior was as-expected when this doc was created ~2 years ago -- the different variants produced the same results. We could create integration tests around these queries to test, for example, that the node, edge, and result counts are the same between the variants.

Going to leave this issue open as a nice-to-have...