biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
9 stars 10 forks source link

semmeddb: use UMLS hierarchy in refactored document structure #638

Open andrewsu opened 1 year ago

andrewsu commented 1 year ago

In https://github.com/biothings/biothings_explorer/issues/569, we refactored the semmeddb document structure (currently available in https://biothings.ncats.io/semmeddb2). In the original semmeddb API, each document was a predication from semmeddb. In the revised API, each document was a unique subject-predicate-object triple. The revised structured included a pmid_count and predication_count. In https://github.com/NCATSTranslator/Feedback/issues/100, we are considering adding a filter for Translator based on pmid_count.

This issue is to propose using the UMLS hierarchy in the creation of records. So, for example, suppose we have disease_x - has_phenotype - phenotype_y that was mentioned in one semmeddb predication. But suppose that disease_x has a subclass disease_x1 and a phenotype_y has a subclass phenotype_y1, and there is also one semmddb predication that says disease_x1 - has_phenotype - phenotype_y1. Currently (in the semmeddb2 API), both the records disease_x - has_phenotype - phenotype_y and disease_x1 - has_phenotype - phenotype_y1 would have a pmid_count of 1. I would propose instead that disease_x - has_phenotype - phenotype_y have a pmid_count of 2 because of the subclass relationships.

Recognizing this would be a substantial increase in complexity in the parser and potentially substantially increase execution time, I'm going to immediately put this issue on hold. But I'm creating the issue just to track it.

erikyao commented 1 year ago

Thoughts on design

Refactor for readability

The current semmeddb parser is complicated, and it's not very straightforward to do so many jobs (parsing, finding children, updating pmid_count, etc) in a single parser run.

Therefore it's better to revise the current plugin from a manifest-based to an advanced one, which enables some hooks (like "after dump", "after upload") so the jobs can be better orchestrated.

Finding the sub-predications

We can once again use [umls-parsed.json](https://github.com/biothings/node-expansion/blob/main/data/umls-parsed.json) from the node_expansion project, for CUI hierarchy. However we need a more clear definition of sub-predications.

Suppose we have disease x, phenotype y, which has children like:

   _ x _
  /     \
x1       ...

   _ y _
  /     \
y1       ...

Do we consider x has_phenotype y1, or x1 has_phenotype y? Or do we only take x1 has_phenotype y1 into account? @andrewsu

andrewsu commented 1 year ago

Do we consider x has_phenotype y1, or x1 has_phenotype y? Or do we only take x1 has_phenotype y1 into account?

From my perspective, I think all three of the triples above would be included when counting x has_phenotype y. But I'm open to being convinced if you or others have a different opinion here.

Also, I think it's clear, but we are still counting unique PMIDs after combining. So if the PMIDs cited for each triple looks like this:

x has_phenotype y: [a, b] x has_phenotype y1: [a, c] x1 has_phenotype y: [b, c] x1 has_phenotype y1: [c]

then I think pmid_count = 3. Does this behavior sound right?

erikyao commented 1 year ago

Hi @andrewsu, thanks for your input.

I think all three of the triples above would be included

Reasonable.

Does this behavior sound right?

Absolutely.

andrewsu commented 1 year ago

on further review, I think this issue may be a duplicate of #607. @erikyao if you agree, please close one of them...