RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
39 stars 8 forks source link

A Lot of Nodes in KG2 and KG2c are Orphan Nodes #58

Open ecwood opened 3 years ago

ecwood commented 3 years ago

On KG2.6.4, match (n) where not (n)--() return count(n) returns 1981277, which is about 18.6% of all nodes. On KG2.6.3, match (n) where not (n)--() return count(n) returns 1935380, which is about 31% of all nodes.

It would be good if the report script could document this each build. It would also be good to know where these nodes are coming from.

kvarforl commented 3 years ago

Looking into where they're coming from using

match (n) where not (n)-[]-() return distinct(n.provided_by), count(*) order by count(*) DESC

Kg2.6.4

(n.provided_by) count(*)
"identifiers_org_registry:chembl.compound" 1928491
"OBO:chebi.owl" 18389
"umls_source:MTH" 8227
"OBO:pr.owl" 5263
"OBO:go/extensions/go-plus.owl" 4959
"OBO:doid.owl" 2183
"OBO:foodon.owl" 2065
"OBO:mondo.owl" 1988
"OBO:uberon/ext.owl" 1563
"PathWhiz:" 1240
"identifiers_org_registry:uniprot" 1191
"OBO:pato.owl" 989
"EFO:efo.owl" 747
"ORPHANET:" 618
"umls_source:HL7" 484
"umls_source:MED-RT" 353
"identifiers_org_registry:hmdb" 314
"umls_source:NCI" 272
"OBO:hp.owl" 265
"umls_source:SNOMEDCT" 222
"OBO:mi.owl" 196
"umls_source:FMA" 162
"umls_source:LNC" 159
"OBO:ncbitaxon/subsets/taxslim.owl" 137
"UMLS_STY:" 129
"OBO:ro.owl" 83
"umls_source:CPT" 59
"umls_source:RXNORM" 51
"DrugCentral:" 49
"umls_source:MSH" 38
"SEMMEDDB:" 38
"umls_source:GO" 27
"umls_source:HGNC" 27
"biolink_download_source:biolink-model.owl.ttl" 25
"OBO:bspo.owl" 21
"umls_source:HCPCS" 20
"umls_source:HCPT" 20
"umls_source:VANDF" 20
"OBO:ino.owl" 19
"umls_source:PDQ" 17
"OBO:bfo.owl" 16
"identifiers_org_registry:reactome" 16
"umls_source:OMIM" 15
"umls_source:MEDDRA" 14
"umls_source:MEDLINEPLUS" 12
"umls_source:ICD10CM" 9
"umls_source:ICD9CM" 8
"umls_source:PSY" 8
"umls_source:HPO" 7
"umls_source:NDDF" 6
"OBO:ddanat.owl" 5
"OBO:cl.owl" 5
"umls_source:ATC" 4
"umls_source:ICD10PCS" 4
"OBO:ehdaa2.owl" 4
"umls_source:DRUGBANK" 3
"umls_source:NCBITAXON" 3
"OBO:nbo.owl" 3
saramsey commented 3 years ago

Maybe we could amend the reporting python script to provide info on orphan nodes, i.e., counts by prefix and/or counts by source