Open AlexanderPico opened 1 month ago
Alt proposal: We keep the current enrichment algo as-is, applied to all nodes from traverseResultForNodes
and then add a secondary filter to only keep hits that contain both the input entity (subject) and answer (object).
Advantages:
Disadvantages:
Given the times we're seeing with the present queries to BioThings PFOCR (>15 seconds per batch), it makes more sense to me to reduce back down to just querying for subject and object, and then doing a secondary traverseResultsForNodes
for better scoring.
As of https://github.com/biothings/bte_trapi_query_graph_handler/pull/211, we are no longer traversing the entire result for nodes, instead just looking at direct result-bound nodes. We do traverse results after to increase the node pool, which is used for relevancy ranking. Additionally, the scoring function has been made less expensive in https://github.com/biothings/bte_trapi_query_graph_handler/pull/207
So, the scope of this issue is changed to two objectives:
We've been assessing the result-level enrichment/augmentation with PFOCR hits. The output today has some consistent issues:
In parallel, we've been assessing the edge retrieval hits on the same queries and have been pleasantly surprised by the quality. This observation suggests some specific refinements we might pursue...
Using an MVP2 query of "Genes increased by Bivalirudin" as an example, currently, we
traverseResultForNodes
to collect a set of genes, including subject and object (answer) nodes, along with any intermediate nodes (from 2-hop), and then perform an intersection. We require aMATCH_COUNT_MIN
= 2 and calculate a score based on precision and recall.This is too relaxed. We end up with a ton of hits having 2 matching CURIES, sometimes not include either of the original subject or object nodes. Each of these on their own is a fairly weak and useless finding for researchers.
Proposals:
resultCuries
are restricted to just those two nodes. This would dramatically reduce the number of PFOCR hits, while dramatically increasing their relevance.traverseResultForNodes
in order to get a totalresultGenesInFigure
to be used in the calculation of precision, i.e., for scoring. Both the traversal and the secondary intersection would only be performed on the subset of figures that pass the first intersection, so this might be faster as well.resultGenesInOtherFigures
is time consuming. If so, we could skip it and rely solely on precision to calculate a score.