in semmeddb2 API, calculate `cumulative_pmid_count` that utilizes the UMLS hierarchy

andrewsu commented 1 year ago

In https://github.com/biothings/BioThings_Explorer_TRAPI/issues/569, we created the semmeddb2 API that aggregates records by unique subject-predicate-object triples, and computes pmid_count and predication_count values with the number of unique PMIDs and predications that support that triple. So, for example, https://biothings.ncats.io/semmeddb2/query?q=subject.umls:C0935989%20AND%20object.umls:C0023418%20AND%20predicate:TREATS currently shows that there are 69 unique PMIDs and 74 unique predications for the triple imatinib (C0935989) - treats - leukemia (C0023418).

However, if we are planning on setting a threshold based roughly on "number of supporting pubmed articles", then Matt Brush pointed out that we should also utilize the UMLS hierarchy in computing that metric. So for the imatinib (C0935989) - treats - leukemia (C0023418) example, we should check all "narrower concepts" to imatinib, including Gleevec (C0935987), which has additional PMIDs and predications in support of its role treating leukemia:

https://biothings.ncats.io/semmeddb2/query?q=subject.umls:C0935987%20AND%20object.umls:C0023418%20AND%20predicate:TREATS

Similarly, in checking all narrower concepts for Leukemia, we likely will find other records with additional PMIDs and predications related to imatinib. So the proposal here would be to create new fields called cumulative_pmid_count and cumulative_predications_count that accounts for all combinations of "subject+descendants" and "object+descendants".

Understanding that this may not be trivial from a conceptual and computational standpoint, I'd like to start by getting some technical assessment on feasibility from @erikyao...

erikyao commented 1 year ago

It's feasible technically but we may have to make a multi-stage parser which:

generates all documents without cumulative_pmid_count nor cumulative_predications_count, and
calculates cumulative_pmid_count and cumulative_predications_count afterward.

because we have to wait till all children's counts are ready to compute a parent's cumulative counts.

And I am also thinking about creating an upstream pipeline (or project, in an individual repo) that does all the dirty jobs (data cleaning, cumulatively counting, etc) and make our parser a lightweight bridge between files and mongodb storage.

Untitled Drawing-hackerdraw

I will appreciate @newgene's advice on the design as well.

erikyao commented 1 year ago

Also consider putting upstream pipeline into post-dump stage.

For this issue itself, the calculation of the cumulative_pmid_count can be done at post-upload stage (with MongoDB operations).

biothings / biothings_explorer

in semmeddb2 API, calculate `cumulative_pmid_count` that utilizes the UMLS hierarchy #607