biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 11 forks source link

KG keeping nodes and edges that aren't in results #409

Closed colleenXu closed 2 years ago

colleenXu commented 2 years ago

@tokebe noticed that the message.knowledge_graph section of a TRAPI response contains nodes and edges that aren't in the results section.

Jackson has shown this issue with the analyses https://github.com/biothings/BioThings_Explorer_TRAPI/issues/401#issuecomment-1029298778 and in the reports for automated runs https://dev.api.bte.ncats.io/demotests/ (an example is by going to summary -> A.2_RHOBTB2_twohop.json -> response -> resultsSanityCheck -> KGEdgesMissingFromResults and KGNodesMissingFromResults).

Based on a quick look, I think these nodes/edges are parts of incomplete paths - so it's correct that they do not appear in the results. However, I think we as a team agreed to remove nodes/edges of incomplete paths (they can't fill out a sub-graph that matches the QGraph)...and I thought this was fixed around the time that the new edge-management code was released.

The next steps are to verify that my hunch above is correct...

ariutta commented 2 years ago

My understanding is that the records provided to QueryResults.update are not necessarily all part of complete paths. They should all match up at specific nodes, but some of the records can still be dead ends. The QueryResults.update code stitches together the complete paths, but I don't know what happens to the KG for the dead ends.

colleenXu commented 2 years ago

It might help to take the queries with this issue and...

tokebe commented 2 years ago

Just adding some documentation to use demotests for testing solutions to this issue prior to getting a module-level test written:

colleenXu commented 2 years ago

Next step is likely to update "knowledge graph" after results-assembly, since results-assembly does a good job of pruning "incomplete paths" or nodes/edges that aren't bound to qNodes/qEdges.

From Slack (Marco):

I took a quick look at what happens after query execution and looks like query_graph and results components of the TRAPI response object get updated independently after query execution, so perhaps that's why the sanity check on the dev instance sometimes fails when the graph has entities the results don't. I had a feeling it could be something like that. queryResults seems to prune out results with incomplete paths while the query graph code is much simpler and just adds whatever you give it, it doesn't check for incomplete paths. Perhaps there's a way to update that based on the result of queryResults? This makes sense for multi hop queries why this could be happening but there was another simple query that also showed extra stuff in the query_graph not in the results. But the code from the bteGraph.update (which gets you query_graph) seems to be pretty straight forward. Just putting it out there for discussion