RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
38 stars 8 forks source link

Why do some UMLS nodes not have TUIs listed in the `description` field? #57

Open saramsey opened 3 years ago

saramsey commented 3 years ago

Thank you to Will Byrd for reporting this issue.

For many UMLS nodes in KG2, we include the semantic type (TUI) in the description field. But for some, we do not. For example, the Cypher query

match (n {id: 'UMLS:C0018681'}) return n.name, n.description

shows that for "headache", the description field includes the TUI, as expected. But for the Cypher query

match (n {id: 'UMLS:C0394007'}) return n.name, n.description

the result for "Cerebral Palsy" does not include the TUI in the description field. Why is that? (The subtext here is that Team Unsecret Agent in some cases uses the TUI information for KG2 UMLS nodes, so if we can provide it, that would be helpful to them).

Screen Shot 2021-05-26 at 8 52 57 AM
saramsey commented 3 years ago

see full text of Will Byrd's email in #56

ecwood commented 3 years ago

From KG2.6.7:

match (n) where (split(n.provided_by, ':')[0]='umls_source'or n.provided_by="identifiers_org_registry:umls") and not (n.description contains "UMLS_STY") return count(n)

count(n) --| 186647

match (n) where (split(n.provided_by, ':')[0]='umls_source'or n.provided_by="identifiers_org_registry:umls") and not (n.description contains "UMLS_STY") return n.category, n.provided_by, count(n)
n.category n.provided_by count(n)
"biolink:MolecularEntity" "identifiers_org_registry:umls" 20954
"biolink:ChemicalSubstance" "identifiers_org_registry:umls" 6376
"biolink:Drug" "identifiers_org_registry:umls" 2178
"biolink:InformationContentEntity" "umls_source:ATC" 3
"biolink:NamedThing" "identifiers_org_registry:umls" 1483
"biolink:IndividualOrganism" "identifiers_org_registry:umls" 6883
"biolink:AnatomicalEntity" "identifiers_org_registry:umls" 926
"biolink:InformationContentEntity" "umls_source:DRUGBANK" 2
"biolink:Protein" "identifiers_org_registry:umls" 896
"biolink:GrossAnatomicalStructure" "identifiers_org_registry:umls" 2214
"biolink:CellularComponent" "identifiers_org_registry:umls" 6060
"biolink:InformationContentEntity" "umls_source:FMA" 164
"biolink:Cell" "identifiers_org_registry:umls" 1059
"biolink:InformationContentEntity" "identifiers_org_registry:umls" 24006
"biolink:PhysiologicalProcess" "identifiers_org_registry:umls" 32797
"biolink:Disease" "identifiers_org_registry:umls" 19689
"biolink:GenomicEntity" "identifiers_org_registry:umls" 769
"biolink:DiseaseOrPhenotypicFeature" "identifiers_org_registry:umls" 17469
"biolink:MolecularActivity" "identifiers_org_registry:umls" 26543
"biolink:Phenomenon" "identifiers_org_registry:umls" 2946
"biolink:Activity" "identifiers_org_registry:umls" 1513
"biolink:PathologicalProcess" "identifiers_org_registry:umls" 1365
"biolink:InformationContentEntity" "umls_source:GO" 26
"biolink:Procedure" "identifiers_org_registry:umls" 7553
"biolink:Device" "identifiers_org_registry:umls" 697
"biolink:InformationContentEntity" "umls_source:HCPCS" 19
"biolink:InformationContentEntity" "umls_source:HGNC" 26
"biolink:InformationContentEntity" "umls_source:HL7" 59
"biolink:InformationContentEntity" "umls_source:HPO" 6
"biolink:PopulationOfIndividualOrganisms" "identifiers_org_registry:umls" 244
"biolink:InformationContentEntity" "umls_source:ICD10PCS" 3
"biolink:InformationContentEntity" "umls_source:ICD9CM" 7
"biolink:GeographicLocation" "identifiers_org_registry:umls" 341
"biolink:InformationContentEntity" "umls_source:LNC" 158
"biolink:Agent" "identifiers_org_registry:umls" 626
"biolink:InformationContentEntity" "umls_source:MEDLINEPLUS" 10
"biolink:InformationContentEntity" "umls_source:MED-RT" 6
"biolink:BiologicalEntity" "identifiers_org_registry:umls" 123
"biolink:InformationContentEntity" "umls_source:MSH" 37
"biolink:Carbohydrate" "identifiers_org_registry:umls" 1
"biolink:InformationContentEntity" "umls_source:NCBITAXON" 2
"biolink:InformationContentEntity" "umls_source:NCI" 271
"biolink:InformationContentEntity" "umls_source:NDDF" 5
"biolink:NamedThing" "umls_source:OMIM" 14
"biolink:InformationContentEntity" "umls_source:PDQ" 16
"biolink:InformationContentEntity" "umls_source:PSY" 7
"biolink:InformationContentEntity" "umls_source:RXNORM" 50
"biolink:InformationContentEntity" "umls_source:VANDF" 19
"biolink:InformationContentEntity" "umls_source:MTH" 26
match (n) where (n.provided_by="identifiers_org_registry:umls") return not (n.description contains "UMLS_STY"), split(n.id, ':')[0], count(n)
not (n.description contains "UMLS_STY") split(n.id, ':')[0] count(n)
null "UMLS" 2785105
false "UMLS" 157507
true "UMLS" 185711
ecwood commented 3 years ago

This is important for implementing #86.

ecwood commented 1 year ago

This is definitely still an issue, as of KG2.8.3:

match (n) where (n.provided_by in ["['infores:atc-codes-umls']", "['infores:cpt-codes-umls']", "['infores:drugbank']", "['infores:fma-umls']", "['infores:go']", "['infores:hcp-codes-umls']", "['infores:hcpcs-cpt-umls']", "['infores:hgnc']", "['infores:hl7-umls']", "['infores:hpo']", "['infores:icd10-umls']", "['infores:icd10ae-umls']", "['infores:icd10cm-umls']", "['infores:icd10pcs-umls']", "['infores:icd9cm-umls']", "['infores:loinc-umls']", "['infores:medrt-umls']", "['infores:meddra-umls']", "['infores:medlineplus']", "['infores:mesh']", "['infores:umls-metathesaurus']", "['infores:ncbi-taxonomy']", "['infores:ncit']", "['infores:nddf-umls']", "['infores:ndfrt']", "['infores:omim']", "['infores:pdq-umls']", "['infores:psy-umls']", "['infores:rxnorm']", "['infores:snomedct']", "['infores:vandf-umls']", "['infores:umls']"]) and not (n.description contains "STY") return n.provided_by, count(n) order by count(n) desc
n.provided_by count(n)
"['infores:umls']" 198443
"['infores:hpo']" 1720
"['infores:drugbank']" 879
"['infores:loinc-umls']" 139
"['infores:mesh']" 36
"['infores:atc-codes-umls']" 1
ecwood commented 1 year ago

In order to get some sample CURIES, I ran:

match (n) where (n.provided_by="['infores:hpo']") and not (n.description contains "STY") return n.id, n.name, n.description limit 10

since it's easy to identify the TTL file for this source and most of the issue nodes aren't biolink:InformationContentEntity nodes.

Here are the results: n.id n.name n.description
"MAXO:0000555" "interleukin-1 alpha biomarker measurement" "Detection of interleukin-1 alpha, a mediator of the inflammatory response."
"MAXO:0000558" "interleukin-12 biomarker measurement" "Detection of interleukin-12 levels, an inflammatory cytokine."
"MAXO:0000559" "tumor necrosis factor-alpha biomarker measurement" "Detection of TNF-alpha levels, a cytokine involved in systemic inflammation."
"MPATH:515" "non-Lymphoid neoplasias" "Hematological neoplasias of non-lymphoid origin."
"MAXO:0000520" "obstetric ultrasonography" "Use of medical ultrasonography in pregnancy where sound waves are used to create a real-time visual image of the developing fetus in the uterus. Imaging can include the mother's ovaries and uterus as well."
"MAXO:0000529" "prenatal genetic testing" "Testing of fetal DNA during pregnancy to determine if the fetus has chromosomal aberrations, fetal aneuploidy, or other detectable genetic disorders."
"MPATH:502" "monocytic leukaemia" "Leukaemia in which neoplastic cells are poorly or moderately differentiated with a monocytic but no neutrophilic component. At least 20% of the cells must be blasts."
"MAXO:0000527" "physical examination" "A systemic evaluation of the body and its functions using visual inspection, palpation, percussion and auscultation. The purpose is to determine the presence or absence of physical signs of disease or abnormality for an individual's health assessment."
"MAXO:0000528" "prenatal examination" "A test or diagnostic examination to assess the health status of the mother and well being of the fetus."
"MAXO:0000526" "clinical examination" "A direct assessment of a patient's condition by a clinical health professional that is based on a physical exam, medical history, and the patient's account of symptoms."

Here's how I found out the category information:

match (n) where (n.provided_by in ["['infores:atc-codes-umls']", "['infores:cpt-codes-umls']", "['infores:drugbank']", "['infores:fma-umls']", "['infores:go']", "['infores:hcp-codes-umls']", "['infores:hcpcs-cpt-umls']", "['infores:hgnc']", "['infores:hl7-umls']", "['infores:hpo']", "['infores:icd10-umls']", "['infores:icd10ae-umls']", "['infores:icd10cm-umls']", "['infores:icd10pcs-umls']", "['infores:icd9cm-umls']", "['infores:loinc-umls']", "['infores:medrt-umls']", "['infores:meddra-umls']", "['infores:medlineplus']", "['infores:mesh']", "['infores:umls-metathesaurus']", "['infores:ncbi-taxonomy']", "['infores:ncit']", "['infores:nddf-umls']", "['infores:ndfrt']", "['infores:omim']", "['infores:pdq-umls']", "['infores:psy-umls']", "['infores:rxnorm']", "['infores:snomedct']", "['infores:vandf-umls']", "['infores:umls']"]) and not (n.description contains "STY") return n.category, n.provided_by, count(n) order by count(n) desc
n.category n.provided_by count(n)
"biolink:PhysiologicalProcess" "['infores:umls']" 31261
"biolink:MolecularActivity" "['infores:umls']" 26624
"biolink:Disease" "['infores:umls']" 22273
"biolink:DiseaseOrPhenotypicFeature" "['infores:umls']" 19000
"biolink:ChemicalEntity" "['infores:umls']" 18970
"biolink:Publication" "['infores:umls']" 18074
"biolink:InformationContentEntity" "['infores:umls']" 9282
"biolink:NamedThing" "['infores:umls']" 8521
"biolink:Procedure" "['infores:umls']" 6946
"biolink:CellularComponent" "['infores:umls']" 6101
"biolink:OrganismTaxon" "['infores:umls']" 5420
"biolink:Phenomenon" "['infores:umls']" 3429
"biolink:Activity" "['infores:umls']" 2684
"biolink:Polypeptide" "['infores:umls']" 2326
"biolink:Drug" "['infores:umls']" 2191
"biolink:GrossAnatomicalStructure" "['infores:umls']" 2187
"biolink:Device" "['infores:umls']" 1672
"biolink:PathologicalProcess" "['infores:umls']" 1622
"biolink:BiologicalEntity" "['infores:umls']" 1471
"biolink:Cell" "['infores:umls']" 1313
"biolink:AnatomicalEntity" "['infores:umls']" 1064
"biolink:Behavior" "['infores:umls']" 1024
"biolink:PhysicalEntity" "['infores:umls']" 995
"biolink:PhenotypicFeature" "['infores:hpo']" 896
"biolink:Cohort" "['infores:umls']" 805
"biolink:SmallMolecule" "['infores:drugbank']" 773
"biolink:Agent" "['infores:umls']" 687
"biolink:NamedThing" "['infores:hpo']" 581
"biolink:PhenotypicFeature" "['infores:umls']" 569
"biolink:NucleicAcidEntity" "['infores:umls']" 503
"biolink:IndividualOrganism" "['infores:umls']" 393
"biolink:GeographicLocation" "['infores:umls']" 356
"biolink:PopulationOfIndividualOrganisms" "['infores:umls']" 258
"biolink:Food" "['infores:umls']" 218
"biolink:SmallMolecule" "['infores:umls']" 147
"biolink:InformationContentEntity" "['infores:loinc-umls']" 139
"biolink:ChemicalEntity" "['infores:drugbank']" 106
"biolink:BiologicalEntity" "['infores:hpo']" 69
"biolink:Protein" "['infores:hpo']" 66
"biolink:Event" "['infores:umls']" 54
"biolink:BehavioralFeature" "['infores:hpo']" 46
"biolink:InformationContentEntity" "['infores:mesh']" 36
"biolink:Activity" "['infores:hpo']" 32
"biolink:InformationContentEntity" "['infores:hpo']" 26
"biolink:Protein" "['infores:umls']" 3
"biolink:BiologicalProcess" "['infores:hpo']" 3
"biolink:InformationContentEntity" "['infores:atc-codes-umls']" 1
"biolink:InformationResource" "['infores:hpo']" 1