RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
37 stars 8 forks source link

KEGG disease nodes are categorized as Pathways #210

Open amykglen opened 2 years ago

amykglen commented 2 years ago

Andy Crouse reported an issue where very un-pathway-like things are being returned when querying for biolink:Pathways associated with a disease. See this query: https://arax.ncats.io/?r=41416

It looks like the synonym clusters returned all have biolink:Pathway as a category, meaning Expand seems to be functioning correctly. So perhaps some nodes in KG2pre have a category of Pathway when they really shouldn't?

Here's an example - rheumatoid arthritis (MONDO:0008383):

These are the names of the nodes in this synonym cluster in KG2c:

"all_names": [
    "rheumatoid arthritis",
    "Proliferative arthritis",
    "Rheumatoid arthritis related phenotypic feature",
    "Arthritis, Rheumatoid",
    "Rheumatoid arthritis",
    "Rheumatoid arthritis, susceptibility to",
    "Rheumatoid Arthritis"
  ],

And these are the equivalent curies in the cluster:

"equivalent_curies": [
    "NCIT:C2884",
    "MESH:D001172",
    "KEGG:05323",
    "LOINC:LA15161-5",
    "MONDO:0008383",
    "UMLS:C1306838",
    "LOINC:LP30644-6",
    "OMIM:180300",
    "PSY:44570",
    "EFO:0000685",
    "DOID:7148",
    "OMIM:MTHU049127",
    "UMLS:C0003873",
    "HP:0001370",
    "ICD9:714.0",
    "UMLS:C1833448"
  ],

And all the different categories in the cluster:

  "all_categories": [
    "biolink:Disease",
    "biolink:PhenotypicFeature",
    "biolink:DiseaseOrPhenotypicFeature",
    "biolink:NamedThing",
    "biolink:Pathway"
  ],

I haven't yet looked in KG2pre to see which specific equivalent curies have a category of biolink:Pathway since the KG2pre Neo4j isn't running, but maybe that'd be a good first step.

edeutsch commented 2 years ago

This is a good example. KEGG has this: https://www.genome.jp/dbget-bin/www_bget?pathway:map05323 The Node Synonymizer shows this at https://arax.ncats.io/?term=KEGG:05323 image which I think means that KG2pre has a node KEGG:05323 which is considered a biolink:Pathway. (The "Nodes" block there is KG2pre nodes I think?)

Perhaps it is the case that not all KEGG CURIEs are Pathways, but the KG2pre import mechanism assumes that all KEGG CURIEs are pathways and assigns them as biolink:Pathway?

I hypothesize that if KEGG:05323 were labeled as a biolink:Disease in KG2pre, then this would not have happened.

amykglen commented 2 years ago

yup, looks like the KEGG node is the problem as @edeutsch explained above in the case of rheumatoid arthritis - just ran this query on KG2.7.5pre neo4j:

match (n) where n.id in [
    "NCIT:C2884",
    "MESH:D001172",
    "KEGG:05323",
    "LOINC:LA15161-5",
    "MONDO:0008383",
    "UMLS:C1306838",
    "LOINC:LP30644-6",
    "OMIM:180300",
    "PSY:44570",
    "EFO:0000685",
    "DOID:7148",
    "OMIM:MTHU049127",
    "UMLS:C0003873",
    "HP:0001370",
    "ICD9:714.0",
    "UMLS:C1833448"
  ] return n.id, n.name, n.category, n.knowledge_source order by n.id
n.id n.name n.category n.knowledge_source
"DOID:7148" "rheumatoid arthritis" "biolink:Disease" "infores:disease-ontology"
"EFO:0000685" "rheumatoid arthritis" "biolink:Disease" "infores:efo"
"HP:0001370" "Rheumatoid arthritis" "biolink:PhenotypicFeature" "infores:hpo"
"ICD9:714.0" "Rheumatoid arthritis" "biolink:Disease" "infores:icd9cm-umls"
"KEGG:05323" "Rheumatoid arthritis" "biolink:Pathway" "infores:kegg"
"LOINC:LA15161-5" "Rheumatoid Arthritis" "biolink:Disease" "infores:loinc-umls"
"LOINC:LP30644-6" "Rheumatoid arthritis" "biolink:Disease" "infores:loinc-umls"
"MESH:D001172" "Arthritis, Rheumatoid" "biolink:Disease" "infores:mesh"
"MONDO:0008383" "rheumatoid arthritis" "biolink:Disease" "infores:mondo"
"NCIT:C2884" "Rheumatoid Arthritis" "biolink:Disease" "infores:ncit"
"OMIM:180300" "Rheumatoid arthritis related phenotypic feature" "biolink:PhenotypicFeature" "infores:omim"
"OMIM:MTHU049127" "Rheumatoid arthritis" "biolink:NamedThing" "infores:omim"
"PSY:44570" "Rheumatoid Arthritis" "biolink:Disease" "infores:psy-umls"
"UMLS:C0003873" "Rheumatoid Arthritis" "biolink:Disease" "infores:umls"
"UMLS:C1306838" "Proliferative arthritis" "biolink:Disease" "infores:umls-metathesaurus"
"UMLS:C1833448" "Rheumatoid arthritis, susceptibility to" "biolink:DiseaseOrPhenotypicFeature" "infores:umls"

(the KEGG node appears to be the only Pathway node)

amykglen commented 2 years ago

though I just looked into a second example (MONDO:0005550 - 'infectious disease'), and in this case it seems that a REACT node is the problem:

match (n) where n.id in [
    "ICD9:079.0",
    "EFO:0005741",
    "LOINC:MTHU040564",
    "DOID:0050117",
    "OBI:1110040",
    "UMLS:C0001485",
    "UMLS:C0009450",
    "LOINC:LA22091-5",
    "LOINC:LP128526-3",
    "LOINC:LP32901-8",
    "NCIT:C26726",
    "MESH:D003141",
    "REACT:R-HSA-5663205",
    "PSY:10513",
    "MONDO:0005550"
  ] return n.id, n.name, n.category, n.knowledge_source order by n.id
n.id n.name n.category n.knowledge_source
"DOID:0050117" "disease by infectious agent" "biolink:Disease" "infores:disease-ontology"
"EFO:0005741" "infectious disease" "biolink:Disease" "infores:efo"
"ICD9:079.0" "Adenovirus infection in conditions classified elsewhere and of unspecified site" "biolink:Disease" "infores:icd9cm-umls"
"LOINC:LA22091-5" "Infectious disease" "biolink:Disease" "infores:loinc-umls"
"LOINC:LP128526-3" "Infectious disease" "biolink:Disease" "infores:loinc-umls"
"LOINC:LP32901-8" "Infectious disease" "biolink:Disease" "infores:loinc-umls"
"LOINC:MTHU040564" "Infectious disease" "biolink:Disease" "infores:loinc-umls"
"MESH:D003141" "Communicable Diseases" "biolink:Disease" "infores:mesh"
"MONDO:0005550" "infectious disease" "biolink:Disease" "infores:mondo"
"NCIT:C26726" "Infectious Disorder" "biolink:Disease" "infores:ncit"
"OBI:1110040" "infectious disease" "biolink:NamedThing" "infores:genepio"
"PSY:10513" "Communicable Diseases" "biolink:Disease" "infores:psy-umls"
"REACT:R-HSA-5663205" "Infectious disease" "biolink:Pathway" "infores:reactome"
"UMLS:C0001485" "Adenovirus infection in conditions classified elsewhere and of unspecified site" "biolink:Disease" "infores:umls"
"UMLS:C0009450" "Communicable Diseases" "biolink:Disease" "infores:umls"
amykglen commented 2 years ago

but a little more surveying seems to suggest KEGG is the bigger problem - some other examples of KEGG nodes incorrectly labeled as Pathways in KG2pre are:

amykglen commented 1 year ago

this is still an issue in RTX-KG2pre; for instance, this node in KG2.8.2pre named 'Melanoma' has a category of Pathway:

{
  "id": "KEGG:05218",
  "name": "Melanoma",
  "full_name": "Melanoma",
  "category": "biolink:Pathway",
  "category_label": "pathway",
  "iri": "https://www.genome.jp/dbget-bin/www_bget?pathway:maphsa05218",
  "deprecated": "False",
  "provided_by": "['infores:kegg']",
  "update_date": "2023-02-17 17:39:33",
  "publications": [
    "PMID:16822996",
    "PMID:12894244",
    "PMID:16750612",
    "PMID:15841168",
    "PMID:16001050",
    "PMID:16001072",
    "PMID:16899407",
    "PMID:15009714",
    "PMID:11224709",
    "PMID:15721476",
    "PMID:14695152",
    "PMID:15557758",
    "PMID:10843728"
  ]
}

not sure if the solution is renaming the node something like "Melanoma pathway", or changing the category to Disease? probably depends on what KEGG intended to capture in this identifier..

ecwood commented 1 year ago

This node also has a bad IRI, which we want to fix. The actual node is here. Within KEGG, it is also called "Melanoma". We may want to survey the other pathway nodes from KEGG to see if this is a systemic issue (because then it would be easier to make a fix).

ecwood commented 1 year ago

Here's a list of all of the KEGG Pathway names. It does seem like a lot of them would benefit from "pathway" being appended at the end.

[
    "Biotin metabolism",
    "Malaria",
    "Carbon metabolism",
    "Fanconi anemia pathway",
    "Proximal tubule bicarbonate reclamation",
    "Thermogenesis",
    "NOD-like receptor signaling pathway",
    "Peroxisome",
    "TNF signaling pathway",
    "Histidine metabolism",
    "Alzheimer disease",
    "Arrhythmogenic right ventricular cardiomyopathy",
    "Adrenergic signaling in cardiomyocytes",
    "Neutrophil extracellular trap formation",
    "Melanoma",
    "ABC transporters",
    "SNARE interactions in vesicular transport",
    "Choline metabolism in cancer",
    "Estrogen signaling pathway",
    "Proteoglycans in cancer",
    "Homologous recombination",
    "Sulfur relay system",
    "Notch signaling pathway",
    "Longevity regulating pathway",
    "RIG-I-like receptor signaling pathway",
    "Pathogenic Escherichia coli infection",
    "Oxytocin signaling pathway",
    "Fatty acid elongation",
    "Salivary secretion",
    "HIF-1 signaling pathway",
    "Small cell lung cancer",
    "Aldosterone-regulated sodium reabsorption",
    "Neomycin, kanamycin and gentamicin biosynthesis",
    "Endocytosis",
    "Salmonella infection",
    "Ubiquinone and other terpenoid-quinone biosynthesis",
    "Chemical carcinogenesis - DNA adducts",
    "Graft-versus-host disease",
    "Necroptosis",
    "Vitamin B6 metabolism",
    "T cell receptor signaling pathway",
    "Aminoacyl-tRNA biosynthesis",
    "Virion - Adenovirus",
    "Apoptosis - multiple species",
    "Legionellosis",
    "Viral myocarditis",
    "Inflammatory bowel disease",
    "Cholesterol metabolism",
    "Citrate cycle (TCA cycle)",
    "Viral protein interaction with cytokine and cytokine receptor",
    "Glutamatergic synapse",
    "Ubiquitin mediated proteolysis",
    "Pentose and glucuronate interconversions",
    "Gap junction",
    "Cortisol synthesis and secretion",
    "Carbohydrate digestion and absorption",
    "Human T-cell leukemia virus 1 infection",
    "Yersinia infection",
    "cAMP signaling pathway",
    "Glycosphingolipid biosynthesis - globo and isoglobo series",
    "Amphetamine addiction",
    "Retinol metabolism",
    "Growth hormone synthesis, secretion and action",
    "Long-term potentiation",
    "Pancreatic secretion",
    "Galactose metabolism",
    "Acute myeloid leukemia",
    "Autoimmune thyroid disease",
    "Dopaminergic synapse",
    "Mitophagy - animal",
    "Thyroid hormone synthesis",
    "Taste transduction",
    "Biosynthesis of unsaturated fatty acids",
    "Sphingolipid metabolism",
    "Toll-like receptor signaling pathway",
    "Linoleic acid metabolism",
    "Phenylalanine, tyrosine and tryptophan biosynthesis",
    "Long-term depression",
    "Viral carcinogenesis",
    "Influenza A",
    "Endocrine resistance",
    "Thiamine metabolism",
    "African trypanosomiasis",
    "Lipoic acid metabolism",
    "Fc gamma R-mediated phagocytosis",
    "Cysteine and methionine metabolism",
    "Steroid hormone biosynthesis",
    "EGFR tyrosine kinase inhibitor resistance",
    "Hepatitis B",
    "Diabetic cardiomyopathy",
    "Platinum drug resistance",
    "Glycolysis / Gluconeogenesis",
    "Alanine, aspartate and glutamate metabolism",
    "GnRH signaling pathway",
    "Lipid and atherosclerosis",
    "RNA degradation",
    "Phototransduction",
    "Base excision repair",
    "Mucin type O-glycan biosynthesis",
    "AGE-RAGE signaling pathway in diabetic complications",
    "Sphingolipid signaling pathway",
    "Adipocytokine signaling pathway",
    "Caffeine metabolism",
    "Phosphatidylinositol signaling system",
    "D-Amino acid metabolism",
    "Oxidative phosphorylation",
    "Glycine, serine and threonine metabolism",
    "Bladder cancer",
    "Cell adhesion molecules",
    "Morphine addiction",
    "Tryptophan metabolism",
    "C-type lectin receptor signaling pathway",
    "Vasopressin-regulated water reabsorption",
    "Phenylalanine metabolism",
    "Nicotinate and nicotinamide metabolism",
    "Mismatch repair",
    "Neuroactive ligand-receptor interaction",
    "VEGF signaling pathway",
    "Inositol phosphate metabolism",
    "Glutathione metabolism",
    "Circadian rhythm",
    "Cell cycle",
    "Sulfur metabolism",
    "p53 signaling pathway",
    "Fatty acid biosynthesis",
    "Alcoholic liver disease",
    "Hippo signaling pathway - multiple species",
    "Amino sugar and nucleotide sugar metabolism",
    "Fluid shear stress and atherosclerosis",
    "Prolactin signaling pathway",
    "Protein processing in endoplasmic reticulum",
    "Spliceosome",
    "Human papillomavirus infection",
    "Intestinal immune network for IgA production",
    "Pathways in cancer",
    "Terpenoid backbone biosynthesis",
    "Glycerolipid metabolism",
    "Fructose and mannose metabolism",
    "Serotonergic synapse",
    "PI3K-Akt signaling pathway",
    "RNA polymerase",
    "Vascular smooth muscle contraction",
    "Breast cancer",
    "DNA replication",
    "Metabolic pathways",
    "Biosynthesis of cofactors",
    "Rap1 signaling pathway",
    "Transcriptional misregulation in cancer",
    "MAPK signaling pathway",
    "Glucagon signaling pathway",
    "Circadian entrainment",
    "Maturity onset diabetes of the young",
    "Virion - Human immunodeficiency virus",
    "Folate biosynthesis",
    "Kaposi sarcoma-associated herpesvirus infection",
    "Nitrogen metabolism",
    "Type I diabetes mellitus",
    "Glyoxylate and dicarboxylate metabolism",
    "Th1 and Th2 cell differentiation",
    "Epstein-Barr virus infection",
    "ECM-receptor interaction",
    "Huntington disease",
    "Arginine biosynthesis",
    "Protein digestion and absorption",
    "Biosynthesis of amino acids",
    "Glycosylphosphatidylinositol (GPI)-anchor biosynthesis",
    "Phosphonate and phosphinate metabolism",
    "Staphylococcus aureus infection",
    "Allograft rejection",
    "Phospholipase D signaling pathway",
    "Virion - Flavivirus",
    "Adherens junction",
    "Ribosome",
    "Fc epsilon RI signaling pathway",
    "Fatty acid degradation",
    "beta-Alanine metabolism",
    "Retrograde endocannabinoid signaling",
    "Focal adhesion",
    "Drug metabolism - other enzymes",
    "Tight junction",
    "Gastric acid secretion",
    "mRNA surveillance pathway",
    "Osteoclast differentiation",
    "cGMP-PKG signaling pathway",
    "Glycosaminoglycan biosynthesis - keratan sulfate",
    "Regulation of lipolysis in adipocytes",
    "Olfactory transduction",
    "Selenocompound metabolism",
    "Protein export",
    "Vitamin digestion and absorption",
    "Asthma",
    "Chemokine signaling pathway",
    "Calcium signaling pathway",
    "JAK-STAT signaling pathway",
    "Collecting duct acid secretion",
    "Oocyte meiosis",
    "Lysine degradation",
    "Arachidonic acid metabolism",
    "Amoebiasis",
    "Ribosome biogenesis in eukaryotes",
    "Other glycan degradation",
    "Central carbon metabolism in cancer",
    "Renin-angiotensin system",
    "Nucleocytoplasmic transport",
    "Bacterial invasion of epithelial cells",
    "Bile secretion",
    "Chagas disease",
    "Rheumatoid arthritis",
    "Renin secretion",
    "TGF-beta signaling pathway",
    "Leukocyte transendothelial migration",
    "AMPK signaling pathway",
    "Vibrio cholerae infection",
    "Endocrine and other factor-regulated calcium reabsorption",
    "Basal transcription factors",
    "Mannose type O-glycan biosynthesis",
    "Th17 cell differentiation",
    "Insulin resistance",
    "Pyrimidine metabolism",
    "mTOR signaling pathway",
    "Glycosaminoglycan biosynthesis - heparan sulfate / heparin",
    "Nicotine addiction",
    "Measles",
    "Apoptosis",
    "Primary immunodeficiency",
    "Natural killer cell mediated cytotoxicity",
    "Fat digestion and absorption",
    "Cellular senescence",
    "Endometrial cancer",
    "Valine, leucine and isoleucine degradation",
    "Pantothenate and CoA biosynthesis",
    "Ovarian steroidogenesis",
    "Pentose phosphate pathway",
    "Ascorbate and aldarate metabolism",
    "Shigellosis",
    "Prion disease",
    "Fatty acid metabolism",
    "MicroRNAs in cancer",
    "2-Oxocarboxylic acid metabolism",
    "Steroid biosynthesis",
    "Toxoplasmosis",
    "Systemic lupus erythematosus",
    "Motor proteins",
    "Platelet activation",
    "Antigen processing and presentation",
    "Valine, leucine and isoleucine biosynthesis",
    "Glycosaminoglycan degradation",
    "Starch and sucrose metabolism",
    "Regulation of actin cytoskeleton",
    "Butanoate metabolism",
    "Cocaine addiction",
    "Tyrosine metabolism",
    "alpha-Linolenic acid metabolism",
    "Glycosphingolipid biosynthesis - lacto and neolacto series",
    "B cell receptor signaling pathway",
    "Hippo signaling pathway",
    "Non-alcoholic fatty liver disease",
    "PPAR signaling pathway",
    "Neurotrophin signaling pathway",
    "Primary bile acid biosynthesis",
    "Various types of N-glycan biosynthesis",
    "Metabolism of xenobiotics by cytochrome P450",
    "Virion - Lyssavirus",
    "Leishmaniasis",
    "Autophagy - other",
    "NF-kappa B signaling pathway",
    "Ras signaling pathway",
    "Purine metabolism",
    "Cholinergic synapse",
    "Glycerophospholipid metabolism",
    "FoxO signaling pathway",
    "GABAergic synapse",
    "Human immunodeficiency virus 1 infection",
    "Phagosome",
    "Herpes simplex virus 1 infection",
    "One carbon pool by folate",
    "Complement and coagulation cascades",
    "Hepatitis C",
    "Non-small cell lung cancer",
    "Axon guidance",
    "Nucleotide metabolism",
    "Virion - Herpesvirus",
    "Taurine and hypotaurine metabolism",
    "Wnt signaling pathway",
    "Dilated cardiomyopathy",
    "Pertussis",
    "Coronavirus disease - COVID-19",
    "Ferroptosis",
    "Cytokine-cytokine receptor interaction",
    "Type II diabetes mellitus",
    "Cardiac muscle contraction",
    "Porphyrin metabolism",
    "Insulin secretion",
    "Hypertrophic cardiomyopathy",
    "Cushing syndrome",
    "Autophagy - animal",
    "Viral life cycle - HIV-1",
    "Propanoate metabolism",
    "Human cytomegalovirus infection",
    "IL-17 signaling pathway",
    "PD-L1 expression and PD-1 checkpoint pathway in cancer",
    "Hedgehog signaling pathway",
    "Synaptic vesicle cycle",
    "Insulin signaling pathway",
    "Relaxin signaling pathway",
    "Nucleotide excision repair",
    "Glycosaminoglycan biosynthesis - chondroitin sulfate / dermatan sulfate",
    "Glioma",
    "Gastric cancer",
    "N-Glycan biosynthesis",
    "Pyruvate metabolism",
    "Longevity regulating pathway - multiple species",
    "ErbB signaling pathway",
    "Parathyroid hormone synthesis, secretion and action",
    "Thyroid cancer",
    "Renal cell carcinoma",
    "Hepatocellular carcinoma",
    "Riboflavin metabolism",
    "Proteasome",
    "Pancreatic cancer",
    "Ether lipid metabolism",
    "Drug metabolism - cytochrome P450",
    "Progesterone-mediated oocyte maturation",
    "Hematopoietic cell lineage",
    "Non-homologous end-joining",
    "GnRH secretion",
    "Pathways of neurodegeneration - multiple diseases",
    "Basal cell carcinoma",
    "Colorectal cancer",
    "Tuberculosis",
    "Aldosterone synthesis and secretion",
    "Spinocerebellar ataxia",
    "Alcoholism",
    "Inflammatory mediator regulation of TRP channels",
    "Chemical carcinogenesis - reactive oxygen species",
    "Melanogenesis",
    "Arginine and proline metabolism",
    "Parkinson disease",
    "Apelin signaling pathway",
    "Glycosphingolipid biosynthesis - ganglio series",
    "Amyotrophic lateral sclerosis",
    "Epithelial cell signaling in Helicobacter pylori infection",
    "Thyroid hormone signaling pathway",
    "Antifolate resistance",
    "Cytosolic DNA-sensing pathway",
    "Signaling pathways regulating pluripotency of stem cells",
    "Chronic myeloid leukemia",
    "Other types of O-glycan biosynthesis",
    "Biosynthesis of nucleotide sugars",
    "Chemical carcinogenesis - receptor activation",
    "Lysosome",
    "Mineral absorption",
    "Prostate cancer"
]