KG keeping nodes and edges that aren't in results

colleenXu commented 2 years ago

@tokebe noticed that the message.knowledge_graph section of a TRAPI response contains nodes and edges that aren't in the results section.

Jackson has shown this issue with the analyses https://github.com/biothings/BioThings_Explorer_TRAPI/issues/401#issuecomment-1029298778 and in the reports for automated runs https://dev.api.bte.ncats.io/demotests/ (an example is by going to summary -> A.2_RHOBTB2_twohop.json -> response -> resultsSanityCheck -> KGEdgesMissingFromResults and KGNodesMissingFromResults).

Based on a quick look, I think these nodes/edges are parts of incomplete paths - so it's correct that they do not appear in the results. However, I think we as a team agreed to remove nodes/edges of incomplete paths (they can't fill out a sub-graph that matches the QGraph)...and I thought this was fixed around the time that the new edge-management code was released.

The next steps are to verify that my hunch above is correct...

do we need to review these analyses further to figure out what is happening?
@marcodarko is the edge-manager code (before results assembly) supposed to remove these nodes/edges from the KG / is that code working?
I think @ariutta has previously been part of this conversation as well....and the assumption going into results-assembly code was that the records provided to it were all part of complete paths. I'm now not sure if that assumption is accurate.

ariutta commented 2 years ago

My understanding is that the records provided to QueryResults.update are not necessarily all part of complete paths. They should all match up at specific nodes, but some of the records can still be dead ends. The QueryResults.update code stitches together the complete paths, but I don't know what happens to the KG for the dead ends.

colleenXu commented 2 years ago

It might help to take the queries with this issue and...

look at the edge-execution order and the entities used to query for each QEdge. When are these extra KGEdges being made (aka is the pruning on that level working)?

tokebe commented 2 years ago

Just adding some documentation to use demotests for testing solutions to this issue prior to getting a module-level test written:

After making some code change and running a query to see what the response is
POST that response to https://dev.api.bte.ncats.io/demotests/inspect
demotests will return a response summarizing number of results, etc., and importantly for this case, the results of the sanity check. As @colleenXu pointed out, response -> resultsSanityCheck -> KGEdgesMissingFromResults and KGNodesMissingFromResults is the most relevant information for testing this issue.

colleenXu commented 2 years ago

Next step is likely to update "knowledge graph" after results-assembly, since results-assembly does a good job of pruning "incomplete paths" or nodes/edges that aren't bound to qNodes/qEdges.

From Slack (Marco):

I took a quick look at what happens after query execution and looks like query_graph and results components of the TRAPI response object get updated independently after query execution, so perhaps that's why the sanity check on the dev instance sometimes fails when the graph has entities the results don't. I had a feeling it could be something like that. queryResults seems to prune out results with incomplete paths while the query graph code is much simpler and just adds whatever you give it, it doesn't check for incomplete paths. Perhaps there's a way to update that based on the result of queryResults? This makes sense for multi hop queries why this could be happening but there was another simple query that also showed extra stuff in the query_graph not in the results. But the code from the bteGraph.update (which gets you query_graph) seems to be pretty straight forward. Just putting it out there for discussion

biothings / biothings_explorer

KG keeping nodes and edges that aren't in results #409