RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Another curious BTE result #1369

Closed dkoslicki closed 1 year ago

dkoslicki commented 3 years ago

For this PK: 6d853ff1-ea50-4cd1-8609-9089ae26db76 a few of the results are mis-categorized. Eg.

Name: NESIRITIDE
Id: CHEMBL.COMPOUND:CHEMBL1201668
Category: biolink:Gene
equivalent_identifiers (biolink:id): 
   NCBIGENE:4879
   name:natriuretic peptide B
   SYMBOL:NPPB
   UMLS:C1417808
   UMLS:C0054015
   HGNC:7940
   UniProtKB:P16860
   ENSEMBL:ENSG00000120937
   OMIM:600295

It’s categorized as a gene, but is actually a chemical substance. Looking at the result, it appears this node came from BTE, so perhaps it’s a synonymizer problem when results are merged.

amykglen commented 3 years ago

hmm, yeah, it looks like the synonymizer says 'Nesiritide' and 'Natriuretic Peptide B' are the same thing:

Screen Shot 2021-04-13 at 9 49 18 AM

(so I think BTE answered with 'Natriuretic Peptide B' but then when merging results, the synonymizer mistakenly changed it to 'Nesiritide'.)

edeutsch commented 3 years ago

I'm guessing that is another manifestation of the problems with the SRI Node Normalizer discussed here: https://github.com/TranslatorSRI/NodeNormalization/issues/56

Although I haven't gotten notice yet, I also guessing that those problems are probably fixed. So I will begin another build and see if that fixes this problem.

edeutsch commented 3 years ago

well, after Googling around a little, it seems that lots of places say they are the same thing: https://en.wikipedia.org/wiki/Nesiritide Or maybe more specifically nesiritide is a name for a "chemical substance" that is the recombinant protein. So it is a protein? yes! Is it a therapeutic? yes! Is it a drug? yes? Is it a chemical substance? In the way that we use it, I suppose yes. Is it a gene? err, genes and proteins are not the same thing. but there's a gene that encodes for this protein

I think this falls into the same category as insulin and probably many others, where it's all those things and we need to have some special handling for human proteins that are also considered therapeutics.

dkoslicki commented 3 years ago

Interesting find! Though there are other results besides Nesiritide that appear to be miscategorized, such as “ Geleophysic dysplasia 2” (a disease) being conflated with fibrillin (a gene). Atrial fibrillation is another such example, unfortunately

ecwood commented 3 years ago

Amy reported that this bug is still present in KG2.6.7. I did some investigating and it looks like the root of the weird synonymization is a DrugBank edge and the odd naming of a UniProt node.

I started by getting the equivalent curies from KG2.6.7c:

match (n) where 'CHEMBL.COMPOUND:CHEMBL1201668' in n.equivalent_curies return n
{
  "iri": "https://identifiers.org/chembl.compound:CHEMBL1201668",
  "expanded_categories": [
    "biolink:ChemicalSubstance",
    "biolink:Drug",
    "biolink:MolecularEntity",
    "biolink:BiologicalEntity",
    "biolink:NamedThing"
  ],
  "name": "NESIRITIDE",
  "description": "-!- FUNCTION: [Brain natriuretic peptide 32]: Cardiac hormone that plays a key role in mediating cardio-renal homeostasis (PubMed:9458824, PubMed:1672777, PubMed:1914098, PubMed:17372040). May also function as a paracrine antifibrotic factor in the heart (By similarity). Acts by specifically binding and stimulating NPR1 to produce cGMP, which in turn activates effector proteins that drive various biological responses (PubMed:9458824, PubMed:1672777, PubMed:17372040, PubMed:21098034, PubMed:17349887, PubMed:25339504). Involved in regulating the extracellular fluid volume and maintaining the fluid-electrolyte balance through natriuresis, diuresis, vasorelaxation, and inhibition of renin and aldosterone secretion (PubMed:9458824, PubMed:1914098). Binds the clearance receptor NPR3 (PubMed:16870210). {ECO:0000250|UniProtKB:P40753, ECO:0000269|PubMed:1672777, ECO:0000269|PubMed:16870210, ECO:0000269|PubMed:17349887, ECO:0000269|PubMed:17372040, ECO:0000269|PubMed:1914098, ECO:0000269|PubMed:21098034, ECO:0000269|PubMed:25339504, ECO:0000269|PubMed:9458824}. -!- FUNCTION: [NT-proBNP]: May affect cardio-renal homeostasis (PubMed:17372040). Able to promote the production of cGMP although its potency is very low compared to brain natriuretic peptide 32 (PubMed:17372040). {ECO:0000269|PubMed:17372040}. -!- FUNCTION: [BNP(3-32)]: May have a role in cardio-renal homeostasis (PubMed:17372040). Able to promote the production of cGMP (PubMed:17372040). {ECO:0000269|PubMed:17372040}. -!- INTERACTION: P16860; A8MQ03: CYSRT1; NbExp=3; IntAct=EBI-747044, EBI-3867333; P16860; P57678: GEMIN4; NbExp=3; IntAct=EBI-747044, EBI-356700; P16860; Q6A162: KRT40; NbExp=3; IntAct=EBI-747044, EBI-10171697; P16860; P60411: KRTAP10-9; NbExp=3; IntAct=EBI-747044, EBI-10172052; P16860; Q7Z3S9: NOTCH2NLA; NbExp=4; IntAct=EBI-747044, EBI-945833; P16860; P25788: PSMA3; NbExp=3; IntAct=EBI-747044, EBI-348380; P16860; Q9UJW9: SERTAD3; NbExp=3; IntAct=EBI-747044, EBI-748621; -!- SUBCELLULAR LOCATION: [NT-proBNP]: Secreted {ECO:0000269|PubMed:18466803, ECO:0000269|PubMed:25339504}. Note=Detected in blood. {ECO:0000269|PubMed:18466803, ECO:0000269|PubMed:25339504}. -!- SUBCELLULAR LOCATION: [proBNP(3-108)]: Secreted {ECO:0000269|PubMed:17367664}. Note=Detected in blood. {ECO:0000269|PubMed:17367664}. -!- SUBCELLULAR LOCATION: [Brain natriuretic peptide 32]: Secreted {ECO:0000269|PubMed:17367664, ECO:0000269|PubMed:18466803, ECO:0000269|PubMed:1914098, ECO:0000269|PubMed:25339504}. Note=Detected in blood. {ECO:0000269|PubMed:17367664, ECO:0000269|PubMed:18466803, ECO:0000269|PubMed:1914098, ECO:0000269|PubMed:25339504}. -!- SUBCELLULAR LOCATION: [BNP(3-32)]: Secreted {ECO:0000269|PubMed:17367664}. Note=Detected in blood. {ECO:0000269|PubMed:17367664}. -!- TISSUE SPECIFICITY: [Brain natriuretic peptide 32]: Detected in the cardiac atria (at protein level) (PubMed:2138890, PubMed:2136732). Detected in the kidney distal tubular cells (at protein level) (PubMed:9794555). {ECO:0000269|PubMed:2136732, ECO:0000269|PubMed:2138890, ECO:0000269|PubMed:9794555}. -!- PTM: The precursor molecule is proteolytically cleaved by the endoproteases FURIN or CORIN at Arg-102 to produce brain natriuretic peptide 32 and NT-proBNP (PubMed:21314817, PubMed:10880574, PubMed:21763278, PubMed:20489134, PubMed:21482747). This likely occurs after it has been secreted into the blood, either during circulation or in the target cells (PubMed:21482747). CORIN also cleaves the precursor molecule at additional residues including Arg-99 and possibly Lys-105 (PubMed:20489134, PubMed:21763278). In patients with heart failure, processing and degradation of natriuretic peptides B occurs but is delayed, possibly due to a decrease in enzyme level or activity of CORIN and DPP4 (PubMed:25339504). {ECO:0000269|PubMed:10880574, ECO:0000269|PubMed:20489134, ECO:0000269|PubMed:21314817, ECO:0000269|PubMed:21482747, ECO:0000269|PubMed:21763278, ECO:0000269|PubMed:25339504}. -!- PTM: [Brain natriuretic peptide 32]: Undergoes further proteolytic cleavage by various proteases such as DPP4, MME and possibly FAP, to give rise to a variety of shorter peptides (PubMed:16254193, PubMed:19808300, PubMed:21314817, PubMed:21098034). Cleaved at Pro-104 by the prolyl endopeptidase FAP (seprase) activity (in vitro) (PubMed:21314817). Degraded by IDE (PubMed:21098034). During IDE degradation, the resulting products initially increase the activation of NPR1 and can also stimulate NPR2 to produce cGMP before the fragments are completely degraded and inactivated by IDE (in vitro) (PubMed:21098034). {ECO:0000269|PubMed:16254193, ECO:0000269|PubMed:19808300, ECO:0000269|PubMed:21098034, ECO:0000269|PubMed:21314817}. -!- PTM: O-glycosylated on at least seven residues (PubMed:20489134, PubMed:21763278, PubMed:16750161, PubMed:17349887, PubMed:21482747). In cardiomyocytes, glycosylation at Thr-97 is essential for the stability and processing of the extracellular natriuretic peptides B (PubMed:21482747). Glycosylation, especially at Thr-97, may also be important for brain natriuretic peptide 32 stability and/or extracellular distribution (PubMed:21763278). Glycosylation at Thr-97 appears to inhibit FURIN- or CORIN-mediated proteolytic processing, at least in HEK293 cells (PubMed:20489134, PubMed:21763278). {ECO:0000269|PubMed:16750161, ECO:0000269|PubMed:17349887, ECO:0000269|PubMed:20489134, ECO:0000269|PubMed:21482747, ECO:0000269|PubMed:21763278}. -!- PHARMACEUTICAL: Available under the name Nesiritide (Scios). Used for the treatment of heart failure. -!- MISCELLANEOUS: Plasma levels of natriuretic peptides B, brain natriuretic peptide 32 and NT-proBNP are widely used for screening and diagnosis of heart failure (HF), as these markers are typically higher in patients with severe HF. {ECO:0000269|PubMed:17349887, ECO:0000269|PubMed:17372040, ECO:0000269|PubMed:18466803, ECO:0000269|PubMed:21482747, ECO:0000269|PubMed:25339504}. -!- SIMILARITY: Belongs to the natriuretic peptide family. {ECO:0000305}. -!- SEQUENCE CAUTION: Sequence=BAA90441.1; Type=Frameshift; Evidence={ECO:0000305}; -!- WEB RESOURCE: Name=Wikipedia; Note=Brain natriuretic peptide entry; URL="https://en.wikipedia.org/wiki/Brain_natriuretic_peptide"; ; Short=preproBNP {ECO:0000250|UniProtKB:P40753}; Short=proBNP {ECO:0000303|PubMed:2138890}; Short=BNP(1-32) {ECO:0000305}; Short=BNP-32 {ECO:0000305}; Short=BNP {ECO:0000303|PubMed:2597152}; Short=BNP(4-32) {ECO:0000303|PubMed:20489134}Evidence Codes from Name:  ",
  "equivalent_curies": [
    "RXNORM:19666",
    "OMIM:600295",
    "DRUGBANK:DB04899",
    "CHEMBL.COMPOUND:CHEMBL1201668",
    "UMLS:C0054015",
    "PR:000011375",
    "HGNC:7940",
    "PDQ:CDR0000434520",
    "ENSEMBL:ENSG00000120937",
    "NCIT:C47636",
    "PR:P16860",
    "NCIT:C88522",
    "NCBIGene:4879",
    "NDDF:009300",
    "UMLS:C1317554",
    "ATC:C01DX19",
    "UniProtKB:P16860",
    "VANDF:4024230",
    "DrugCentral:1901"
  ],
  "id": "CHEMBL.COMPOUND:CHEMBL1201668",
  "category": "biolink:ChemicalSubstance",
  "all_names": [
    "natriuretic peptides B (human)",
    "Nesiritide",
    "NPPB",
    "Natriuretic Peptides B",
    "NPPB (human)",
    "nesiritide",
    "natriuretic peptides B",
    "NESIRITIDE",
    "Genetic locus associated with NPPB"
  ],
  "all_categories": [
    "biolink:MolecularEntity",
    "biolink:Gene",
    "biolink:Protein",
    "biolink:Drug",
    "biolink:ChemicalSubstance"
  ],
  "publications": [
    "PMID:2136732",
    "DOI:10.1016/s0006-291x(89)80015-4",
    "PMID:9794555",
    "PMID:2597152",
    "PMID:12070532",
    "PMID:17372040",
    "PMID:20489134",
    "DOI:10.1016/j.jacc.2006.10.063",
    "PMID:17367664",
    "DOI:10.1016/0014-5793(90)80043-i"
  ]
}

Then, I ran this query in KG2.6.7's Neo4j endpoint:

match (n) where n.id in ["RXNORM:19666","OMIM:600295","DRUGBANK:DB04899","CHEMBL.COMPOUND:CHEMBL1201668","UMLS:C0054015","PR:000011375","HGNC:7940","PDQ:CDR0000434520","ENSEMBL:ENSG00000120937","NCIT:C47636","PR:P16860","NCIT:C88522","NCBIGene:4879","NDDF:009300","UMLS:C1317554","ATC:C01DX19","UniProtKB:P16860","VANDF:4024230","DrugCentral:1901"] return n

Here's the strange edge:

image

This is similar to the issue here: https://github.com/RTXteam/RTX-KG2/issues/75#issuecomment-864257431

amykglen commented 3 years ago

for KG2.6.7.1c, these conflations appear resolved. although interestingly, the synonymizer seems to say that NESIRITIDE doesn't exist: https://arax.ncats.io/?term=NESIRITIDE

NPPB appears good though: https://arax.ncats.io/?term=NCBIGENE:4879

dkoslicki commented 3 years ago

Odd, since the back end appears to say it exists: https://arax.ncats.io/?r=16944

edeutsch commented 3 years ago

I think we should investigate this a bit more: https://arax.ncats.io/?term=NESIRITIDE

NodeSynonymizer bug?

dkoslicki commented 3 years ago

Especially curious since it shows up in the auto-complete

amykglen commented 3 years ago

for the record, pasting in my explanation from slack of why NESIRITIDE can exist in KG2c and not in the synonymizer:

when the synonymizer doesn't recognize a node, KG2c just backs up to including that node from KG2, without merging it with any others I have that ruling in place because the synonymizer can't always recognize nodes without names, of which there are quite a few in KG2

amykglen commented 1 year ago

this conflation appears resolved on dev/ci instances - we align with the SRI now, who has separate clusters for the drug-like nesiritide vs. protein/gene-like nesiritide:

https://arax.ncats.io/devLM/?term=NESIRITIDE https://arax.ncats.io/devLM/?term=NCBIGene:4879

closing