biothings / semmeddb

1 stars 1 forks source link

`conflate` parameter of Node Normalizer #7

Closed erikyao closed 1 year ago

erikyao commented 1 year ago

Original Purpose: gene-gene equivalence detection

Previously we decided to leverage Node Normalizer to find equivalent NCBI Gene IDs for those gene-presenting CUIs. A predication whose subject or object is an equivalent NCBI Gene ID is considered redundant to the CUI and thus should be deleted by the parser script.

E.g. after the piped CUI C1418660|5361 are separated into 2 rows, there are two predications:

row-id PREDICATION_ID PMID PREDICATE SUBJECT_CUI SUBJECT_NAME SUBJECT_SEMTYPE OBJECT_CUI OBJECT_NAME OBJECT_SEMTYPE
69865 14008146 16541019 INHIBITS C1418660 PLXNA1 gene gngm C1418661 PLXNA2 gene gngm
69866 14008146 16541019 INHIBITS 5361 PLXNA1 gngm C1418661 PLXNA2 gene gngm

5361 is equivalent to C1418660 so the second predication can be deleted.

Move Further: protein-gene equivalence detection when conflate is true

Passing {"conflate": true} to the Node Normalizer means "asking the endpoint to return conflated data" (currently only Gene-Protein conflation). See Babel output formats >> Conflation.

We do have such protein-gene data in the SemMedDB predications, e.g.:

row-id PREDICATION_ID PMID PREDICATE SUBJECT_CUI SUBJECT_NAME SUBJECT_SEMTYPE OBJECT_CUI OBJECT_NAME OBJECT_SEMTYPE
64933 10603013 16530496 ASSOCIATED_WITH C0020063 PTH protein, human aapp C0029463 osteosarcoma neop
64934 10603013 16530496 ASSOCIATED_WITH 5741 PTH aapp C0029463 osteosarcoma neop

Node Normalizer with {"conflate": true} is able to report the equivalence between C0020063 and 5741

QUESTION: Shall we enable conflate to delete such redundant predications (like the second row above)?

Outlaws: peptide-gene equivalence?

E.g.

row-id PREDICATION_ID PMID PREDICATE SUBJECT_CUI SUBJECT_NAME SUBJECT_SEMTYPE OBJECT_CUI OBJECT_NAME OBJECT_SEMTYPE
64923 10597756 16530483 INTERACTS_WITH C0027893 neuropeptide Y gngm C0039194 T-Lymphocyte cell
64924 10597756 16530483 INTERACTS_WITH 4852 NPY gngm C0039194 T-Lymphocyte cell

Node normalizer CANNOT report the equivalence between C0027893 and 4852.

QUESTION: Will this be a trouble for BTE?

erikyao commented 1 year ago

Jan 4th meeting, Colleen's input: