Open ecwood opened 3 years ago
Looking into where they're coming from using
match (n) where not (n)-[]-() return distinct(n.provided_by), count(*) order by count(*) DESC
Kg2.6.4
(n.provided_by) | count(*) |
---|---|
"identifiers_org_registry:chembl.compound" | 1928491 |
"OBO:chebi.owl" | 18389 |
"umls_source:MTH" | 8227 |
"OBO:pr.owl" | 5263 |
"OBO:go/extensions/go-plus.owl" | 4959 |
"OBO:doid.owl" | 2183 |
"OBO:foodon.owl" | 2065 |
"OBO:mondo.owl" | 1988 |
"OBO:uberon/ext.owl" | 1563 |
"PathWhiz:" | 1240 |
"identifiers_org_registry:uniprot" | 1191 |
"OBO:pato.owl" | 989 |
"EFO:efo.owl" | 747 |
"ORPHANET:" | 618 |
"umls_source:HL7" | 484 |
"umls_source:MED-RT" | 353 |
"identifiers_org_registry:hmdb" | 314 |
"umls_source:NCI" | 272 |
"OBO:hp.owl" | 265 |
"umls_source:SNOMEDCT" | 222 |
"OBO:mi.owl" | 196 |
"umls_source:FMA" | 162 |
"umls_source:LNC" | 159 |
"OBO:ncbitaxon/subsets/taxslim.owl" | 137 |
"UMLS_STY:" | 129 |
"OBO:ro.owl" | 83 |
"umls_source:CPT" | 59 |
"umls_source:RXNORM" | 51 |
"DrugCentral:" | 49 |
"umls_source:MSH" | 38 |
"SEMMEDDB:" | 38 |
"umls_source:GO" | 27 |
"umls_source:HGNC" | 27 |
"biolink_download_source:biolink-model.owl.ttl" | 25 |
"OBO:bspo.owl" | 21 |
"umls_source:HCPCS" | 20 |
"umls_source:HCPT" | 20 |
"umls_source:VANDF" | 20 |
"OBO:ino.owl" | 19 |
"umls_source:PDQ" | 17 |
"OBO:bfo.owl" | 16 |
"identifiers_org_registry:reactome" | 16 |
"umls_source:OMIM" | 15 |
"umls_source:MEDDRA" | 14 |
"umls_source:MEDLINEPLUS" | 12 |
"umls_source:ICD10CM" | 9 |
"umls_source:ICD9CM" | 8 |
"umls_source:PSY" | 8 |
"umls_source:HPO" | 7 |
"umls_source:NDDF" | 6 |
"OBO:ddanat.owl" | 5 |
"OBO:cl.owl" | 5 |
"umls_source:ATC" | 4 |
"umls_source:ICD10PCS" | 4 |
"OBO:ehdaa2.owl" | 4 |
"umls_source:DRUGBANK" | 3 |
"umls_source:NCBITAXON" | 3 |
"OBO:nbo.owl" | 3 |
Maybe we could amend the reporting python script to provide info on orphan nodes, i.e., counts by prefix and/or counts by source
On
KG2.6.4
,match (n) where not (n)--() return count(n)
returns 1981277, which is about 18.6% of all nodes. OnKG2.6.3
,match (n) where not (n)--() return count(n)
returns 1935380, which is about 31% of all nodes.It would be good if the report script could document this each build. It would also be good to know where these nodes are coming from.