Open andrewsu opened 1 year ago
The current semmeddb parser is complicated, and it's not very straightforward to do so many jobs (parsing, finding children, updating pmid_count
, etc) in a single parser run.
Therefore it's better to revise the current plugin from a manifest-based to an advanced one, which enables some hooks (like "after dump", "after upload") so the jobs can be better orchestrated.
We can once again use [umls-parsed.json](https://github.com/biothings/node-expansion/blob/main/data/umls-parsed.json)
from the node_expansion
project, for CUI hierarchy. However we need a more clear definition of sub-predications.
Suppose we have disease x
, phenotype y
, which has children like:
_ x _
/ \
x1 ...
_ y _
/ \
y1 ...
Do we consider x has_phenotype y1
, or x1 has_phenotype y
? Or do we only take x1 has_phenotype y1
into account? @andrewsu
Do we consider
x has_phenotype y1
, orx1 has_phenotype y
? Or do we only takex1 has_phenotype y1
into account?
From my perspective, I think all three of the triples above would be included when counting x has_phenotype y
. But I'm open to being convinced if you or others have a different opinion here.
Also, I think it's clear, but we are still counting unique PMIDs after combining. So if the PMIDs cited for each triple looks like this:
x has_phenotype y
: [a, b]
x has_phenotype y1
: [a, c]
x1 has_phenotype y
: [b, c]
x1 has_phenotype y1
: [c]
then I think pmid_count
= 3. Does this behavior sound right?
Hi @andrewsu, thanks for your input.
I think all three of the triples above would be included
Reasonable.
Does this behavior sound right?
Absolutely.
on further review, I think this issue may be a duplicate of #607. @erikyao if you agree, please close one of them...
In https://github.com/biothings/biothings_explorer/issues/569, we refactored the semmeddb document structure (currently available in https://biothings.ncats.io/semmeddb2). In the original semmeddb API, each document was a predication from semmeddb. In the revised API, each document was a unique subject-predicate-object triple. The revised structured included a
pmid_count
andpredication_count
. In https://github.com/NCATSTranslator/Feedback/issues/100, we are considering adding a filter for Translator based onpmid_count
.This issue is to propose using the UMLS hierarchy in the creation of records. So, for example, suppose we have
disease_x - has_phenotype - phenotype_y
that was mentioned in one semmeddb predication. But suppose thatdisease_x
has a subclassdisease_x1
and aphenotype_y
has a subclassphenotype_y1
, and there is also one semmddb predication that saysdisease_x1 - has_phenotype - phenotype_y1
. Currently (in the semmeddb2 API), both the recordsdisease_x - has_phenotype - phenotype_y
anddisease_x1 - has_phenotype - phenotype_y1
would have apmid_count
of 1. I would propose instead thatdisease_x - has_phenotype - phenotype_y
have apmid_count
of 2 because of the subclass relationships.Recognizing this would be a substantial increase in complexity in the parser and potentially substantially increase execution time, I'm going to immediately put this issue on hold. But I'm creating the issue just to track it.