Open andrewsu opened 1 year ago
It's feasible technically but we may have to make a multi-stage parser which:
cumulative_pmid_count
nor cumulative_predications_count
, andcumulative_pmid_count
and cumulative_predications_count
afterward.because we have to wait till all children's counts are ready to compute a parent's cumulative counts.
And I am also thinking about creating an upstream pipeline (or project, in an individual repo) that does all the dirty jobs (data cleaning, cumulatively counting, etc) and make our parser a lightweight bridge between files and mongodb storage.
I will appreciate @newgene's advice on the design as well.
Also consider putting upstream pipeline into post-dump stage.
For this issue itself, the calculation of the cumulative_pmid_count
can be done at post-upload stage (with MongoDB operations).
In https://github.com/biothings/BioThings_Explorer_TRAPI/issues/569, we created the semmeddb2 API that aggregates records by unique subject-predicate-object triples, and computes
pmid_count
andpredication_count
values with the number of unique PMIDs and predications that support that triple. So, for example, https://biothings.ncats.io/semmeddb2/query?q=subject.umls:C0935989%20AND%20object.umls:C0023418%20AND%20predicate:TREATS currently shows that there are 69 unique PMIDs and 74 unique predications for the tripleimatinib (C0935989)
-treats
-leukemia (C0023418)
.However, if we are planning on setting a threshold based roughly on "number of supporting pubmed articles", then Matt Brush pointed out that we should also utilize the UMLS hierarchy in computing that metric. So for the
imatinib (C0935989)
-treats
-leukemia (C0023418)
example, we should check all "narrower concepts" to imatinib, includingGleevec (C0935987)
, which has additional PMIDs and predications in support of its role treating leukemia:https://biothings.ncats.io/semmeddb2/query?q=subject.umls:C0935987%20AND%20object.umls:C0023418%20AND%20predicate:TREATS
Similarly, in checking all narrower concepts for Leukemia, we likely will find other records with additional PMIDs and predications related to imatinib. So the proposal here would be to create new fields called
cumulative_pmid_count
andcumulative_predications_count
that accounts for all combinations of "subject+descendants" and "object+descendants".Understanding that this may not be trivial from a conceptual and computational standpoint, I'd like to start by getting some technical assessment on feasibility from @erikyao...