biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

Handle explain type of query #112

Closed kevinxin90 closed 3 years ago

kevinxin90 commented 3 years ago
{
    "message": {
        "query_graph": {
            "nodes": {
                "a": {
                    "category": "biolink:Disease",
                    "id": "MESH:D015464"
                },
                "b": {
                    "category": "biolink:ChemicalSubstance",
                    "id": "CHEBI:45783"
                },
                "c": {
                    "category": "biolink:Gene"
                }
            },
            "edges": {
                "ac": {
                    "subject": "a",
                    "object": "c"
                },
                "bc": {
                    "subject": "c",
                    "object": "b"
                }
            }
        }
    },
    "knowledge_graph": {
        "nodes": [],
        "edges": []
    },
    "results": []
}
andrewsu commented 3 years ago

Three great examples of "explain" queries from Sui

image

colleenXu commented 3 years ago

This is not working as-expected.


For example, we would expect KCNMA1 -(e0)-> biolink:NamedThing <-(e1)- TAAR1 to do the following:

  1. one-hop KCNMA -(e0)-> biolink:NamedThing
  2. one-hop TAAR1-(e1)-> biolink:NamedThing
  3. either (@andrewsu, we have to pick the desired behavior): A. filter so we only keep the answers that came from BOTH one-hops (remove the rest from nodes/edges/results) B. ~OR do nothing. Return the output of both one-hops as the "answers"~ (Edited AS 2021-06-11: strike out this option)

Instead, something seems to be going on that makes the query (TRAPI, listed in the next comment) take a long time (like >30 minutes). Andrew tried separately running the one-hops (1 and 2) above (also listed in the next comment), and both were quick (<6 seconds each).


I think these logs from my console are relevant. I used a JSON viewer to help me read the path parts. This is my interpretation of the logs:

LOGS:

 biothings-explorer-trapi:query_graph ALL PATHS {"0":[{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}}],"1":[{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":true,"prev_edge":{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":true,"prev_edge":{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}}]} +0ms

  biothings-explorer-trapi:main query paths constructed: {"0":[{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}}],"1":[{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":true,"prev_edge":{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},{"qEdge":{"id":"e1","subject":{"id":"n2","category":["biolink:Gene"],"curie":["HGNC:17734"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":true,"prev_edge":{"qEdge":{"id":"e0","subject":{"id":"n0","category":["biolink:Gene"],"curie":["HGNC:6284"]},"object":{"id":"n1","category":["biolink:NamedThing"]}},"reverse":false,"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}},"input_equivalent_identifiers":{},"output_equivalent_identifiers":{}}]} +1ms

  biothings-explorer-trapi:main Query depth is 2 +1ms
colleenXu commented 3 years ago

TRAPI query for KCNMA1 -(e0)-> biolink:NamedThing <-(e1)- TAAR1

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["HGNC:6284"],
                    "categories": ["biolink:Gene"]
                },
                "n1": {
                    "categories":["biolink:NamedThing"]
                },
                "n2": {
                    "ids":["HGNC:17734"],
                    "categories":["biolink:Gene"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n2",
                    "object": "n1"
                }
            }
        }
    }
}

Fast One Hop 1: KCNMA1-(e0)-> biolink:NamedThing

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["HGNC:6284"],
                    "categories": ["biolink:Gene"]
                },
                "n1": {
            "categories":["biolink:NamedThing"]
                }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

Fast One Hop 2: TAAR1-(e1)-> biolink:NamedThing

{
    "message": {
        "query_graph": {
            "nodes": {
                "n1": {
            "categories":["biolink:NamedThing"]
                },
                "n2": {
                    "ids":["HGNC:17734"],
                    "categories":["biolink:Gene"]
               }
            },
            "edges": {
                "e1": {
                    "subject": "n2",
                    "object": "n1"
                }
            }
        }
    }
}
colleenXu commented 3 years ago

This is a special kind of Explain-query we also want to support (see TRAPI query below): ChemicalSubstance celecoxib (PUBCHEM.COMPOUND:2662) -> PTGS1 (HGNC:9604). It's from a Translator standup meeting.

The minimal expected behavior is:

  1. BTE queries ChemicalSubstance PUBCHEM.COMPOUND:2662 -> Gene.
  2. Then BTE then "filters": only the answer node PTGS1 (HGNC:9604) and edges/results with that answer node are kept. The other message knowledge_graph.nodes/knowledge_graph.edges/results are removed. Have a logs object that mentions that this happened.
  3. If there's no edges/results after the filtering is done, that's fine. Keep an empty object for edges / array for results. There will still be message.query_graph, message.knowledge_graph.nodes, logs.

Currently, BTE is only doing 1 and ID-resolving the gene ID in the query.


Expected edges in the answer: For the example query, I would expect only the following edges to exist in the Response:


The TRAPI query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}
colleenXu commented 3 years ago

This functionality is high priority since it's come up in standup queries and the demo (Workflow D, maybe Workflow C).

andrewsu commented 3 years ago

Note that @ericz1803 found this repo https://github.com/kevinxin90/explain.js from Kevin that handles the special case of explain queries with one intermediate node (used at https://biothings.io/explorer/explain). It is based on @biothings-explorer/call-apis and @biothings-explorer/smartapi-kg, so may be useful to consult when implementing explain queries in the main application. In fact, it could be that the short-term solution to this ticket would be to integrate this code into the main app, leaving the longer-term generalized query handler to handle longer paths and more complex query topologies.

andrewsu commented 3 years ago

one-hop explain query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}

two-hop explain query (version 1):

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               },
                "n2": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n1",
                    "object": "n2"
                }
            }
        }
    }
}

two-hop explain query (version 2):

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               },
                "n2": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n2",
                    "object": "n1"
                }
            }
        }
    }
}
marcodarko commented 3 years ago

https://github.com/biothings/BioThings_Explorer_TRAPI/issues/112#issuecomment-865448828 csgene.txt These are the results I'm getting using the new query handler algorithm, just wanna make sure it's looking OK. Going through some of the queries here as I read it.

marcodarko commented 3 years ago
{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               },
                "n2": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n2",
                    "object": "n1"
                }
            }
        }
    }
}

This is the new result for this two hop query above, new logs will explain the process hopefully. twohop.txt

colleenXu commented 3 years ago

For Workflow D:

Note:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Disease"],
                    "ids": ["MESH:D015464"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}
colleenXu commented 3 years ago

some queries with 2 intermediates:

Note: This test query should have this path as a result: ChemicalSubstance PUBCHEM.COMPOUND:2662 <-> Disease MONDO:0002974 <-> Pathway REACT:R-HSA-109704 <-> HGNC:17947.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               },
               "n2": {
                    "categories":["biolink:Pathway"]
               },
                "n3": {
                    "categories":["biolink:Gene"],
                    "ids":["HGNC:17947"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n1",
                    "object": "n2"
                },
                "e2": {
                    "subject": "n2",
                    "object": "n3"
                }
            }
        }
    }
}

This path should exist: Pathway REACT:R-HSA-1368082 <-> Gene NCBIGene:1374 <-> ChemicalSubstance CHEBI:35553 - Disease MONDO:0009287

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": ["biolink:Pathway"],
                    "ids": ["REACT:R-HSA-1368082"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                },
                "n2": {
                    "categories": ["biolink:ChemicalSubstance"]
                },
                "n3": {
                    "categories": ["biolink:Disease"],
                    "ids": ["MONDO:0009287"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e02": {
                    "subject": "n1",
                    "object": "n2"
                },
                "e03": {
                    "subject": "n3",
                    "object": "n2"
                }
            }
        }
    }
}
colleenXu commented 3 years ago

What the results object should look like:

{
  "node_bindings": { 
    "n0": ["id": "CHEBI:41423"],
    "n1": ["id": "MONDO:0004247"],
    "n2": ["id": "NCBIGene:5742"]
   },
  "edge_bindings": {
    "e0": ["id": "CHEBI:41423-biolink:related_to-MONDO:0004247"],
    "e1": ["id": "NCBIGene:5742-biolink:related_to-MONDO:0004247"]
   }
}

For this query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids":["PUBCHEM.COMPOUND:2662"],
                    "categories":["biolink:ChemicalSubstance"]
                },
                "n1": {
                    "categories":["biolink:Disease"]
               },
                "n2": {
                    "categories":["biolink:Gene"],
                       "ids":["HGNC:9604"]
               }
            },
            "edges": {
                "e0": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e1": {
                    "subject": "n2",
                    "object": "n1"
                }
            }
        }
    }
}
colleenXu commented 3 years ago

Note: Kevin's opening query, reformatted now has results that look as expected. The query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["MESH:D015464"],
                    "categories": ["biolink:Disease"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                },
                "n2": {
                    "ids": ["CHEBI:45783"],
                    "categories": ["biolink:SmallMolecule"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e02": {
                    "subject": "n1",
                    "object": "n2"
                }
            }
        }
    }
}
colleenXu commented 3 years ago

The new query-handler handles these cases, this was checked during my testing process for the code.