biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 10 forks source link

Refinement of PFOCR Result-level Enrichment #847

Open AlexanderPico opened 1 month ago

AlexanderPico commented 1 month ago

We've been assessing the result-level enrichment/augmentation with PFOCR hits. The output today has some consistent issues:

In parallel, we've been assessing the edge retrieval hits on the same queries and have been pleasantly surprised by the quality. This observation suggests some specific refinements we might pursue...

Using an MVP2 query of "Genes increased by Bivalirudin" as an example, currently, we traverseResultForNodes to collect a set of genes, including subject and object (answer) nodes, along with any intermediate nodes (from 2-hop), and then perform an intersection. We require a MATCH_COUNT_MIN = 2 and calculate a score based on precision and recall.

This is too relaxed. We end up with a ton of hits having 2 matching CURIES, sometimes not include either of the original subject or object nodes. Each of these on their own is a fairly weak and useless finding for researchers.

Proposals:

  1. Focus on subject and object nodes. Once we are able to consider chemicals and diseases, then we could perform an intersection where resultCuries are restricted to just those two nodes. This would dramatically reduce the number of PFOCR hits, while dramatically increasing their relevance.
    • We could follow-up these hits with a secondary intersection including the other nodes from traverseResultForNodes in order to get a total resultGenesInFigure to be used in the calculation of precision, i.e., for scoring. Both the traversal and the secondary intersection would only be performed on the subset of figures that pass the first intersection, so this might be faster as well.
  2. I suspect the calculation of resultGenesInOtherFigures is time consuming. If so, we could skip it and rely solely on precision to calculate a score.
AlexanderPico commented 4 weeks ago

Alt proposal: We keep the current enrichment algo as-is, applied to all nodes from traverseResultForNodes and then add a secondary filter to only keep hits that contain both the input entity (subject) and answer (object).

Advantages:

Disadvantages:

tokebe commented 4 weeks ago

Given the times we're seeing with the present queries to BioThings PFOCR (>15 seconds per batch), it makes more sense to me to reduce back down to just querying for subject and object, and then doing a secondary traverseResultsForNodes for better scoring.

tokebe commented 2 weeks ago

As of https://github.com/biothings/bte_trapi_query_graph_handler/pull/211, we are no longer traversing the entire result for nodes, instead just looking at direct result-bound nodes. We do traverse results after to increase the node pool, which is used for relevancy ranking. Additionally, the scoring function has been made less expensive in https://github.com/biothings/bte_trapi_query_graph_handler/pull/207

So, the scope of this issue is changed to two objectives: