Open saramsey opened 3 years ago
see full text of Will Byrd's email in #56
From KG2.6.7:
match (n) where (split(n.provided_by, ':')[0]='umls_source'or n.provided_by="identifiers_org_registry:umls") and not (n.description contains "UMLS_STY") return count(n)
count(n) --| 186647
match (n) where (split(n.provided_by, ':')[0]='umls_source'or n.provided_by="identifiers_org_registry:umls") and not (n.description contains "UMLS_STY") return n.category, n.provided_by, count(n)
n.category | n.provided_by | count(n) |
---|---|---|
"biolink:MolecularEntity" | "identifiers_org_registry:umls" | 20954 |
"biolink:ChemicalSubstance" | "identifiers_org_registry:umls" | 6376 |
"biolink:Drug" | "identifiers_org_registry:umls" | 2178 |
"biolink:InformationContentEntity" | "umls_source:ATC" | 3 |
"biolink:NamedThing" | "identifiers_org_registry:umls" | 1483 |
"biolink:IndividualOrganism" | "identifiers_org_registry:umls" | 6883 |
"biolink:AnatomicalEntity" | "identifiers_org_registry:umls" | 926 |
"biolink:InformationContentEntity" | "umls_source:DRUGBANK" | 2 |
"biolink:Protein" | "identifiers_org_registry:umls" | 896 |
"biolink:GrossAnatomicalStructure" | "identifiers_org_registry:umls" | 2214 |
"biolink:CellularComponent" | "identifiers_org_registry:umls" | 6060 |
"biolink:InformationContentEntity" | "umls_source:FMA" | 164 |
"biolink:Cell" | "identifiers_org_registry:umls" | 1059 |
"biolink:InformationContentEntity" | "identifiers_org_registry:umls" | 24006 |
"biolink:PhysiologicalProcess" | "identifiers_org_registry:umls" | 32797 |
"biolink:Disease" | "identifiers_org_registry:umls" | 19689 |
"biolink:GenomicEntity" | "identifiers_org_registry:umls" | 769 |
"biolink:DiseaseOrPhenotypicFeature" | "identifiers_org_registry:umls" | 17469 |
"biolink:MolecularActivity" | "identifiers_org_registry:umls" | 26543 |
"biolink:Phenomenon" | "identifiers_org_registry:umls" | 2946 |
"biolink:Activity" | "identifiers_org_registry:umls" | 1513 |
"biolink:PathologicalProcess" | "identifiers_org_registry:umls" | 1365 |
"biolink:InformationContentEntity" | "umls_source:GO" | 26 |
"biolink:Procedure" | "identifiers_org_registry:umls" | 7553 |
"biolink:Device" | "identifiers_org_registry:umls" | 697 |
"biolink:InformationContentEntity" | "umls_source:HCPCS" | 19 |
"biolink:InformationContentEntity" | "umls_source:HGNC" | 26 |
"biolink:InformationContentEntity" | "umls_source:HL7" | 59 |
"biolink:InformationContentEntity" | "umls_source:HPO" | 6 |
"biolink:PopulationOfIndividualOrganisms" | "identifiers_org_registry:umls" | 244 |
"biolink:InformationContentEntity" | "umls_source:ICD10PCS" | 3 |
"biolink:InformationContentEntity" | "umls_source:ICD9CM" | 7 |
"biolink:GeographicLocation" | "identifiers_org_registry:umls" | 341 |
"biolink:InformationContentEntity" | "umls_source:LNC" | 158 |
"biolink:Agent" | "identifiers_org_registry:umls" | 626 |
"biolink:InformationContentEntity" | "umls_source:MEDLINEPLUS" | 10 |
"biolink:InformationContentEntity" | "umls_source:MED-RT" | 6 |
"biolink:BiologicalEntity" | "identifiers_org_registry:umls" | 123 |
"biolink:InformationContentEntity" | "umls_source:MSH" | 37 |
"biolink:Carbohydrate" | "identifiers_org_registry:umls" | 1 |
"biolink:InformationContentEntity" | "umls_source:NCBITAXON" | 2 |
"biolink:InformationContentEntity" | "umls_source:NCI" | 271 |
"biolink:InformationContentEntity" | "umls_source:NDDF" | 5 |
"biolink:NamedThing" | "umls_source:OMIM" | 14 |
"biolink:InformationContentEntity" | "umls_source:PDQ" | 16 |
"biolink:InformationContentEntity" | "umls_source:PSY" | 7 |
"biolink:InformationContentEntity" | "umls_source:RXNORM" | 50 |
"biolink:InformationContentEntity" | "umls_source:VANDF" | 19 |
"biolink:InformationContentEntity" | "umls_source:MTH" | 26 |
match (n) where (n.provided_by="identifiers_org_registry:umls") return not (n.description contains "UMLS_STY"), split(n.id, ':')[0], count(n)
not (n.description contains "UMLS_STY") | split(n.id, ':')[0] | count(n) |
---|---|---|
null | "UMLS" | 2785105 |
false | "UMLS" | 157507 |
true | "UMLS" | 185711 |
This is important for implementing #86.
This is definitely still an issue, as of KG2.8.3
:
match (n) where (n.provided_by in ["['infores:atc-codes-umls']", "['infores:cpt-codes-umls']", "['infores:drugbank']", "['infores:fma-umls']", "['infores:go']", "['infores:hcp-codes-umls']", "['infores:hcpcs-cpt-umls']", "['infores:hgnc']", "['infores:hl7-umls']", "['infores:hpo']", "['infores:icd10-umls']", "['infores:icd10ae-umls']", "['infores:icd10cm-umls']", "['infores:icd10pcs-umls']", "['infores:icd9cm-umls']", "['infores:loinc-umls']", "['infores:medrt-umls']", "['infores:meddra-umls']", "['infores:medlineplus']", "['infores:mesh']", "['infores:umls-metathesaurus']", "['infores:ncbi-taxonomy']", "['infores:ncit']", "['infores:nddf-umls']", "['infores:ndfrt']", "['infores:omim']", "['infores:pdq-umls']", "['infores:psy-umls']", "['infores:rxnorm']", "['infores:snomedct']", "['infores:vandf-umls']", "['infores:umls']"]) and not (n.description contains "STY") return n.provided_by, count(n) order by count(n) desc
n.provided_by | count(n) |
---|---|
"['infores:umls']" | 198443 |
"['infores:hpo']" | 1720 |
"['infores:drugbank']" | 879 |
"['infores:loinc-umls']" | 139 |
"['infores:mesh']" | 36 |
"['infores:atc-codes-umls']" | 1 |
In order to get some sample CURIES, I ran:
match (n) where (n.provided_by="['infores:hpo']") and not (n.description contains "STY") return n.id, n.name, n.description limit 10
since it's easy to identify the TTL file for this source and most of the issue nodes aren't biolink:InformationContentEntity
nodes.
Here are the results: n.id | n.name | n.description |
---|---|---|
"MAXO:0000555" | "interleukin-1 alpha biomarker measurement" | "Detection of interleukin-1 alpha, a mediator of the inflammatory response." |
"MAXO:0000558" | "interleukin-12 biomarker measurement" | "Detection of interleukin-12 levels, an inflammatory cytokine." |
"MAXO:0000559" | "tumor necrosis factor-alpha biomarker measurement" | "Detection of TNF-alpha levels, a cytokine involved in systemic inflammation." |
"MPATH:515" | "non-Lymphoid neoplasias" | "Hematological neoplasias of non-lymphoid origin." |
"MAXO:0000520" | "obstetric ultrasonography" | "Use of medical ultrasonography in pregnancy where sound waves are used to create a real-time visual image of the developing fetus in the uterus. Imaging can include the mother's ovaries and uterus as well." |
"MAXO:0000529" | "prenatal genetic testing" | "Testing of fetal DNA during pregnancy to determine if the fetus has chromosomal aberrations, fetal aneuploidy, or other detectable genetic disorders." |
"MPATH:502" | "monocytic leukaemia" | "Leukaemia in which neoplastic cells are poorly or moderately differentiated with a monocytic but no neutrophilic component. At least 20% of the cells must be blasts." |
"MAXO:0000527" | "physical examination" | "A systemic evaluation of the body and its functions using visual inspection, palpation, percussion and auscultation. The purpose is to determine the presence or absence of physical signs of disease or abnormality for an individual's health assessment." |
"MAXO:0000528" | "prenatal examination" | "A test or diagnostic examination to assess the health status of the mother and well being of the fetus." |
"MAXO:0000526" | "clinical examination" | "A direct assessment of a patient's condition by a clinical health professional that is based on a physical exam, medical history, and the patient's account of symptoms." |
Here's how I found out the category information:
match (n) where (n.provided_by in ["['infores:atc-codes-umls']", "['infores:cpt-codes-umls']", "['infores:drugbank']", "['infores:fma-umls']", "['infores:go']", "['infores:hcp-codes-umls']", "['infores:hcpcs-cpt-umls']", "['infores:hgnc']", "['infores:hl7-umls']", "['infores:hpo']", "['infores:icd10-umls']", "['infores:icd10ae-umls']", "['infores:icd10cm-umls']", "['infores:icd10pcs-umls']", "['infores:icd9cm-umls']", "['infores:loinc-umls']", "['infores:medrt-umls']", "['infores:meddra-umls']", "['infores:medlineplus']", "['infores:mesh']", "['infores:umls-metathesaurus']", "['infores:ncbi-taxonomy']", "['infores:ncit']", "['infores:nddf-umls']", "['infores:ndfrt']", "['infores:omim']", "['infores:pdq-umls']", "['infores:psy-umls']", "['infores:rxnorm']", "['infores:snomedct']", "['infores:vandf-umls']", "['infores:umls']"]) and not (n.description contains "STY") return n.category, n.provided_by, count(n) order by count(n) desc
n.category | n.provided_by | count(n) |
---|---|---|
"biolink:PhysiologicalProcess" | "['infores:umls']" | 31261 |
"biolink:MolecularActivity" | "['infores:umls']" | 26624 |
"biolink:Disease" | "['infores:umls']" | 22273 |
"biolink:DiseaseOrPhenotypicFeature" | "['infores:umls']" | 19000 |
"biolink:ChemicalEntity" | "['infores:umls']" | 18970 |
"biolink:Publication" | "['infores:umls']" | 18074 |
"biolink:InformationContentEntity" | "['infores:umls']" | 9282 |
"biolink:NamedThing" | "['infores:umls']" | 8521 |
"biolink:Procedure" | "['infores:umls']" | 6946 |
"biolink:CellularComponent" | "['infores:umls']" | 6101 |
"biolink:OrganismTaxon" | "['infores:umls']" | 5420 |
"biolink:Phenomenon" | "['infores:umls']" | 3429 |
"biolink:Activity" | "['infores:umls']" | 2684 |
"biolink:Polypeptide" | "['infores:umls']" | 2326 |
"biolink:Drug" | "['infores:umls']" | 2191 |
"biolink:GrossAnatomicalStructure" | "['infores:umls']" | 2187 |
"biolink:Device" | "['infores:umls']" | 1672 |
"biolink:PathologicalProcess" | "['infores:umls']" | 1622 |
"biolink:BiologicalEntity" | "['infores:umls']" | 1471 |
"biolink:Cell" | "['infores:umls']" | 1313 |
"biolink:AnatomicalEntity" | "['infores:umls']" | 1064 |
"biolink:Behavior" | "['infores:umls']" | 1024 |
"biolink:PhysicalEntity" | "['infores:umls']" | 995 |
"biolink:PhenotypicFeature" | "['infores:hpo']" | 896 |
"biolink:Cohort" | "['infores:umls']" | 805 |
"biolink:SmallMolecule" | "['infores:drugbank']" | 773 |
"biolink:Agent" | "['infores:umls']" | 687 |
"biolink:NamedThing" | "['infores:hpo']" | 581 |
"biolink:PhenotypicFeature" | "['infores:umls']" | 569 |
"biolink:NucleicAcidEntity" | "['infores:umls']" | 503 |
"biolink:IndividualOrganism" | "['infores:umls']" | 393 |
"biolink:GeographicLocation" | "['infores:umls']" | 356 |
"biolink:PopulationOfIndividualOrganisms" | "['infores:umls']" | 258 |
"biolink:Food" | "['infores:umls']" | 218 |
"biolink:SmallMolecule" | "['infores:umls']" | 147 |
"biolink:InformationContentEntity" | "['infores:loinc-umls']" | 139 |
"biolink:ChemicalEntity" | "['infores:drugbank']" | 106 |
"biolink:BiologicalEntity" | "['infores:hpo']" | 69 |
"biolink:Protein" | "['infores:hpo']" | 66 |
"biolink:Event" | "['infores:umls']" | 54 |
"biolink:BehavioralFeature" | "['infores:hpo']" | 46 |
"biolink:InformationContentEntity" | "['infores:mesh']" | 36 |
"biolink:Activity" | "['infores:hpo']" | 32 |
"biolink:InformationContentEntity" | "['infores:hpo']" | 26 |
"biolink:Protein" | "['infores:umls']" | 3 |
"biolink:BiologicalProcess" | "['infores:hpo']" | 3 |
"biolink:InformationContentEntity" | "['infores:atc-codes-umls']" | 1 |
"biolink:InformationResource" | "['infores:hpo']" | 1 |
Thank you to Will Byrd for reporting this issue.
For many UMLS nodes in KG2, we include the semantic type (TUI) in the
description
field. But for some, we do not. For example, the Cypher queryshows that for "headache", the description field includes the TUI, as expected. But for the Cypher query
the result for "Cerebral Palsy" does not include the TUI in the description field. Why is that? (The subtext here is that Team Unsecret Agent in some cases uses the TUI information for KG2 UMLS nodes, so if we can provide it, that would be helpful to them).